fgbio

Tools for working with genomic and high throughput sequencing data.

View the Project on GitHub

CallCodecConsensusReads

Overview

Group: Unique Molecular Identifiers (UMIs)

Calls consensus sequences from reads generated from the the CODEC protocol. For more information on the CODEC sequencing protocol and the resulting data please refer to Bae et al 2023[1].

Prior to running this tool, reads must have been grouped with GroupReadsByUmi using the adjacency or identity strategy (NOT paired).

Reads from the same original duplex are collected, and the R1s and R2s assembled into single strand consensus reads as described by CallMolecularConsensusReads. Subsequently, a single consensus read is generated including both any single-strand regions as well as the double-stranded region of the template.

The consensus reads produced are unaligned, due to the difficulty and error-prone nature of inferring the consensus alignment. Consensus reads should therefore be aligned after, which should not be too expensive as likely there are significantly fewer consensus reads than input raw reads. Please see how best to use this tool within the best-practice pipeline: https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md

Consensus reads have a number of additional optional tags set in the resulting BAM file. The tag names follow a pattern where the first letter (a, b or c) denotes that the tag applies to the first single strand consensus (a), second single-strand consensus (b) or the final duplex consensus (c). The second letter is intended to capture the meaning of the tag (e.g. d=depth, m=min depth, e=errors/error-rate) and is upper case for values that are one per read and lower case for values that are one per base.

The tags break down into those that are single-valued per read:

consensus depth      [aD,bD,cD] (int)  : the maximum depth of raw reads at any point in the consensus reads
consensus min depth  [aM,bM,cM] (int)  : the minimum depth of raw reads at any point in the consensus reads
consensus error rate [aE,bE,cE] (float): the fraction of bases in raw reads disagreeing with the final consensus calls

And those that have a value per base (duplex values are not generated, but can be generated by summing):

consensus depth  [ad,bd] (short[]): the count of bases contributing to each single-strand consensus read at each position
consensus errors [ae,be] (short[]): the count of bases from raw reads disagreeing with the final single-strand consensus base
consensus errors [ac,bc] (string): the single-strand consensus bases
consensus errors [aq,bq] (string): the single-strand consensus qualities

The per base depths and error counts are both capped at 32,767. In all cases no-calls (Ns) and bases below the min-input-base-quality are not counted in tag value calculations.

[1] https://doi.org/10.1038/s41588-023-01376-0

Arguments

Name Flag Type Description Required? Max # of Values Default Value(s)
input i PathToBam The input SAM or BAM file. Required 1  
output o PathToBam Output SAM or BAM file to write consensus reads. Required 1  
rejects r PathToBam Optional output SAM or BAM file to write reads not used. Optional 1  
stats s FilePath Optional output text file of key consensus calling statistics. Optional 1  
read-name-prefix p String The prefix all consensus read names Optional 1  
read-group-id R String The new read group ID for all the consensus reads. Optional 1 A
error-rate-pre-umi 1 PhredScore The Phred-scaled error rate for an error prior to the UMIs being integrated. Optional 1 45
error-rate-post-umi 2 PhredScore The Phred-scaled error rate for an error post the UMIs have been integrated. Optional 1 40
min-input-base-quality m PhredScore Ignore bases in raw reads that have Q below this value. Optional 1 10
sort-order S SamOrder The sort order of the output, the same as the input if not given. Optional 1  
min-read-pairs M Int The minimum number of codec read pairs to form a consensus read. Optional 1 1
max-read-pairs   Int The maximum number of reads to use when building a single-strand consensus. If more than this many reads are present in a tag family, the family is randomly downsampled to exactly max-reads-pairs reads. Optional 1  
min-duplex-length d Int Minimum length of the duplex region (where R1 and R2 overlap). Optional 1 1
single-strand-qual q PhredScore Reduce quality scores in single stranded regions of the consensus read to the given quality. Optional 1  
outer-bases-qual Q PhredScore Reduce the first and last outer-bases-length bases to the given quality. Optional 1  
outer-bases-length O Int The number of bases at the start and end of the read to reduce quality over if outer-bases-qual is specified. Optional 1 5
max-duplex-disagreement-rate x Double Discard consensus reads where greater than this fraction of duplex bases disagree. Optional 1 1.0
max-duplex-disagreements X Int Discard consensus reads where greater than this number of duplex bases disagree. Optional 1 2147483647
threads   Int The number of threads to use while consensus calling. Optional 1 1