fgbio

Tools for working with genomic and high throughput sequencing data.

View the Project on GitHub

CorrectUmis

Overview

Group: Unique Molecular Identifiers (UMIs)

Corrects UMIs stored in BAM files when a set of fixed UMIs is in use. If the set of UMIs used in an experiment is known and is a subset of the possible randomers of the same length, it is possible to error-correct UMIs prior to grouping reads by UMI. This tool takes an input BAM with UMIs in a tag (RX by default) and set of known UMIs (either on the command line or in a file) and produces:

  1. A new BAM with corrected UMIs in the same tag the UMIs were found in
  2. Optionally a set of metrics about the representation of each UMI in the set
  3. Optionally a second BAM file of reads whose UMIs could not be corrected within the specific parameters

All of the fixed UMIs must be of he same length, and all UMIs in the BAM file must also have the same length. Multiple UMIs that are concatenated with hyphens (e.g. AACCAGT-AGGTAGA) are split apart, corrected individually and then re-assembled. A read is accepted only if all the UMIs can be corrected.

Correction is controlled by two parameters that are applied per-UMI:

  1. –max-mismatches controls how many mismatches (no-calls are counted as mismatches) are tolerated between a UMI as read and a fixed UMI.
  2. –min-distance controls how many more mismatches the next best hit must have

For example, with two fixed UMIs AAAAA and CCCCC and --max-mismatches=3 and --min-distance=2 the following would happen:

The set of fixed UMIs may be specified on the command line using --umis umi1 umi2 ... or via one or more files of UMIs with a single sequence per line using --umi-files umis.txt more_umis.txt. If there are multiple UMIs per template, leading to hyphenated UMI tags, the values for the fixed UMIs should be single, non-hyphenated UMIs (e.g. if a record has RX:Z:ACGT-GGCA, you would use --umis ACGT GGCA).

Records which have their UMIs corrected (i.e. the UMI is not identical to one of the expected UMIs but is close enough to be corrected) will by default have their original UMI stored in the OX tag. This can be disabled with the --dont-store-original-umis option.

For a large number of input UMIs, the --cache-size option may used to speed up the tool. To disable using a cache, set the value to 0.

Arguments

Name Flag Type Description Required? Max # of Values Default Value(s)
input i PathToBam Input SAM or BAM file. Required 1  
output o PathToBam Output SAM or BAM file. Required 1  
rejects r PathToBam Reject BAM file to save unassigned reads. Optional 1  
metrics M FilePath Metrics file to write. Optional 1  
max-mismatches m Int Maximum number of mismatches between a UMI and an expected UMI. Required 1  
min-distance d Int Minimum distance (in mismatches) to next best UMI. Required 1  
umis u String Expected UMI sequences. Optional Unlimited  
umi-files U FilePath File of UMI sequences, one per line. Optional Unlimited  
umi-tag t String Tag in which UMIs are stored. Optional 1 RX
dont-store-original-umis x Boolean Don’t store original UMIs upon correction. Optional 1 false
cache-size   Int The number of uncorrected UMIs to cache; zero will disable the cache. Optional 1 100000
min-corrected   Double The minimum ratio of kept UMIs to accept. A ratio below this will cause a failure (but all files will still be written). Optional 1