fgbio

Tools for working with genomic and high throughput sequencing data.

View the Project on GitHub

CallDuplexConsensusReads

Overview

Group: Unique Molecular Identifiers (UMIs)

Calls duplex consensus sequences from reads generated from the same double-stranded source molecule. Prior to running this tool, read must have been grouped with GroupReadsByUmi using the paired strategy. Doing so will apply (by default) MI tags to all reads of the form */A and */B where the /A and /B suffixes with the same identifier denote reads that are derived from opposite strands of the same source duplex molecule.

Reads from the same unique molecule are first partitioned by source strand and assembled into single strand consensus molecules as described by CallMolecularConsensusReads. Subsequently, for molecules that have at least one observation of each strand, duplex consensus reads are assembled by combining the evidence from the two single strand consensus reads.

Because of the nature of duplex sequencing, this tool does not support fragment reads - if found in the input they are ignored. Similarly, read pairs for which consensus reads cannot be generated for one or other read (R1 or R2) are omitted from the output.

The consensus reads produced are unaligned, due to the difficulty and error-prone nature of inferring the conesensus alignment. Consensus reads should therefore be aligned after, which should not be too expensive as likely there are far fewer consensus reads than input raw raws. Please see how best to use this tool within the best-practice pipeline: https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md

Consensus reads have a number of additional optional tags set in the resulting BAM file. The tag names follow a pattern where the first letter (a, b or c) denotes that the tag applies to the first single strand consensus (a), second single-strand consensus (b) or the final duplex consensus (c). The second letter is intended to capture the meaning of the tag (e.g. d=depth, m=min depth, e=errors/error-rate) and is upper case for values that are one per read and lower case for values that are one per base.

The tags break down into those that are single-valued per read:

consensus depth      [aD,bD,cD] (int)  : the maximum depth of raw reads at any point in the consensus reads
consensus min depth  [aM,bM,cM] (int)  : the minimum depth of raw reads at any point in the consensus reads
consensus error rate [aE,bE,cE] (float): the fraction of bases in raw reads disagreeing with the final consensus calls

And those that have a value per base (duplex values are not generated, but can be generated by summing):

consensus depth  [ad,bd] (short[]): the count of bases contributing to each single-strand consensus read at each position
consensus errors [ae,be] (short[]): the count of bases from raw reads disagreeing with the final single-strand consensus base
consensus errors [ac,bc] (string): the single-strand consensus bases
consensus errors [aq,bq] (string): the single-strand consensus qualities

The per base depths and errors are both capped at 32,767. In all cases no-calls (Ns) and bases below the min-input-base-quality are not counted in tag value calculations.

The –min-reads option can take 1-3 values similar to FilterConsensusReads. For example:

CallDuplexConsensusReads ... --min-reads 10 5 3

If fewer than three values are supplied, the last value is repeated (i.e. 5 4 -> 5 4 4 and 1 -> 1 1 1. The first value applies to the final consensus read, the second value to one single-strand consensus, and the last value to the other single-strand consensus. It is required that if values two and three differ, the more stringent value comes earlier.

Arguments

Name Flag Type Description Required? Max # of Values Default Value(s)
input i PathToBam The input SAM or BAM file. Required 1  
output o PathToBam Output SAM or BAM file to write consensus reads. Required 1  
read-name-prefix p String The prefix all consensus read names Optional 1  
read-group-id R String The new read group ID for all the consensus reads. Optional 1 A
error-rate-pre-umi 1 PhredScore The Phred-scaled error rate for an error prior to the UMIs being integrated. Optional 1 45
error-rate-post-umi 2 PhredScore The Phred-scaled error rate for an error post the UMIs have been integrated. Optional 1 40
min-input-base-quality m PhredScore Ignore bases in raw reads that have Q below this value. Optional 1 10
trim t Boolean If true, quality trim input reads in addition to masking low Q bases. Optional 1 false
sort-order S SamOrder The sort order of the output, the same as the input if not given. Optional 1  
min-reads M Int The minimum number of input reads to a consensus read. Required 3 1
max-reads-per-strand   Int The maximum number of reads to use when building a single-strand consensus. If more than this many reads are present in a tag family, the family is randomly downsampled to exactly max-reads reads. Optional 1  
threads   Int The number of threads to use while consensus calling. Optional 1 1
consensus-call-overlapping-bases   Boolean Consensus call overlapping bases in mapped paired end reads Optional 1 true