Tools for working with genomic and high throughput sequencing data.
Group: Unique Molecular Identifiers (UMIs)
Corrects UMIs stored in BAM files when a set of fixed UMIs is in use. If the set of UMIs used in
an experiment is known and is a subset of the possible randomers of the same length, it is possible
to error-correct UMIs prior to grouping reads by UMI. This tool takes an input BAM with UMIs in a
tag (RX
by default) and set of known UMIs (either on the command line or in a file) and produces:
All of the fixed UMIs must be of he same length, and all UMIs in the BAM file must also have the same
length. Multiple UMIs that are concatenated with hyphens (e.g. AACCAGT-AGGTAGA
) are split apart,
corrected individually and then re-assembled. A read is accepted only if all the UMIs can be corrected.
Correction is controlled by two parameters that are applied per-UMI:
For example, with two fixed UMIs AAAAA
and CCCCC
and --max-mismatches=3
and --min-distance=2
the
following would happen:
The set of fixed UMIs may be specified on the command line using --umis umi1 umi2 ...
or via one or
more files of UMIs with a single sequence per line using --umi-files umis.txt more_umis.txt
. If there
are multiple UMIs per template, leading to hyphenated UMI tags, the values for the fixed UMIs should
be single, non-hyphenated UMIs (e.g. if a record has RX:Z:ACGT-GGCA
, you would use --umis ACGT GGCA
).
Records which have their UMIs corrected (i.e. the UMI is not identical to one of the expected UMIs but is
close enough to be corrected) will by default have their original UMI stored in the OX
tag. This can be
disabled with the --dont-store-original-umis
option.
For a large number of input UMIs, the --cache-size
option may used to speed up the tool. To disable
using a cache, set the value to 0
.
Name | Flag | Type | Description | Required? | Max # of Values | Default Value(s) |
---|---|---|---|---|---|---|
input | i | PathToBam | Input SAM or BAM file. | Required | 1 | |
output | o | PathToBam | Output SAM or BAM file. | Required | 1 | |
rejects | r | PathToBam | Reject BAM file to save unassigned reads. | Optional | 1 | |
metrics | M | FilePath | Metrics file to write. | Optional | 1 | |
max-mismatches | m | Int | Maximum number of mismatches between a UMI and an expected UMI. | Required | 1 | |
min-distance | d | Int | Minimum distance (in mismatches) to next best UMI. | Required | 1 | |
umis | u | String | Expected UMI sequences. | Optional | Unlimited | |
umi-files | U | FilePath | File of UMI sequences, one per line. | Optional | Unlimited | |
umi-tag | t | String | Tag in which UMIs are stored. | Optional | 1 | RX |
dont-store-original-umis | x | Boolean | Don’t store original UMIs upon correction. | Optional | 1 | false |
cache-size | Int | The number of uncorrected UMIs to cache; zero will disable the cache. | Optional | 1 | 100000 | |
min-corrected | Double | The minimum ratio of kept UMIs to accept. A ratio below this will cause a failure (but all files will still be written). | Optional | 1 |