Tools for working with genomic and high throughput sequencing data.
Group: Unique Molecular Identifiers (UMIs)
Calls duplex consensus sequences from reads generated from the same double-stranded source molecule. Prior
to running this tool, read must have been grouped with GroupReadsByUmi
using the paired
strategy. Doing
so will apply (by default) MI tags to all reads of the form */A
and */B
where the /A and /B suffixes
with the same identifier denote reads that are derived from opposite strands of the same source duplex molecule.
Reads from the same unique molecule are first partitioned by source strand and assembled into single strand consensus molecules as described by CallMolecularConsensusReads. Subsequently, for molecules that have at least one observation of each strand, duplex consensus reads are assembled by combining the evidence from the two single strand consensus reads.
Because of the nature of duplex sequencing, this tool does not support fragment reads - if found in the input they are ignored. Similarly, read pairs for which consensus reads cannot be generated for one or other read (R1 or R2) are omitted from the output.
The consensus reads produced are unaligned, due to the difficulty and error-prone nature of inferring the conesensus alignment. Consensus reads should therefore be aligned after, which should not be too expensive as likely there are far fewer consensus reads than input raw raws. Please see how best to use this tool within the best-practice pipeline: https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md
Consensus reads have a number of additional optional tags set in the resulting BAM file. The tag names follow a pattern where the first letter (a, b or c) denotes that the tag applies to the first single strand consensus (a), second single-strand consensus (b) or the final duplex consensus (c). The second letter is intended to capture the meaning of the tag (e.g. d=depth, m=min depth, e=errors/error-rate) and is upper case for values that are one per read and lower case for values that are one per base.
The tags break down into those that are single-valued per read:
consensus depth [aD,bD,cD] (int) : the maximum depth of raw reads at any point in the consensus reads
consensus min depth [aM,bM,cM] (int) : the minimum depth of raw reads at any point in the consensus reads
consensus error rate [aE,bE,cE] (float): the fraction of bases in raw reads disagreeing with the final consensus calls
And those that have a value per base (duplex values are not generated, but can be generated by summing):
consensus depth [ad,bd] (short[]): the count of bases contributing to each single-strand consensus read at each position
consensus errors [ae,be] (short[]): the count of bases from raw reads disagreeing with the final single-strand consensus base
consensus errors [ac,bc] (string): the single-strand consensus bases
consensus errors [aq,bq] (string): the single-strand consensus qualities
The per base depths and errors are both capped at 32,767. In all cases no-calls (Ns) and bases below the min-input-base-quality are not counted in tag value calculations.
The –min-reads option can take 1-3 values similar to FilterConsensusReads
. For example:
CallDuplexConsensusReads ... --min-reads 10 5 3
If fewer than three values are supplied, the last value is repeated (i.e. 5 4
-> 5 4 4
and 1
-> 1 1 1
. The
first value applies to the final consensus read, the second value to one single-strand consensus, and the last
value to the other single-strand consensus. It is required that if values two and three differ,
the more stringent value comes earlier.
Name | Flag | Type | Description | Required? | Max # of Values | Default Value(s) |
---|---|---|---|---|---|---|
input | i | PathToBam | The input SAM or BAM file. | Required | 1 | |
output | o | PathToBam | Output SAM or BAM file to write consensus reads. | Required | 1 | |
read-name-prefix | p | String | The prefix all consensus read names | Optional | 1 | |
read-group-id | R | String | The new read group ID for all the consensus reads. | Optional | 1 | A |
error-rate-pre-umi | 1 | PhredScore | The Phred-scaled error rate for an error prior to the UMIs being integrated. | Optional | 1 | 45 |
error-rate-post-umi | 2 | PhredScore | The Phred-scaled error rate for an error post the UMIs have been integrated. | Optional | 1 | 40 |
min-input-base-quality | m | PhredScore | Ignore bases in raw reads that have Q below this value. | Optional | 1 | 10 |
trim | t | Boolean | If true, quality trim input reads in addition to masking low Q bases. | Optional | 1 | false |
sort-order | S | SamOrder | The sort order of the output, the same as the input if not given. | Optional | 1 | |
min-reads | M | Int | The minimum number of input reads to a consensus read. | Required | 3 | 1 |
max-reads-per-strand | Int | The maximum number of reads to use when building a single-strand consensus. If more than this many reads are present in a tag family, the family is randomly downsampled to exactly max-reads reads. | Optional | 1 | ||
threads | Int | The number of threads to use while consensus calling. | Optional | 1 | 1 | |
consensus-call-overlapping-bases | Boolean | Consensus call overlapping bases in mapped paired end reads | Optional | 1 | true |