CallCodecConsensusReads

Overview

Group: Unique Molecular Identifiers (UMIs)

Calls consensus sequences from reads generated from the the CODEC protocol. For more information on the CODEC sequencing protocol and the resulting data please refer to Bae et al 2023[1].

Prior to running this tool, reads must have been grouped with GroupReadsByUmi using the adjacency or identity strategy (NOT paired).

Reads from the same original duplex are collected, and the R1s and R2s assembled into single strand consensus reads as described by CallMolecularConsensusReads. Subsequently, a single consensus read is generated including both any single-strand regions as well as the double-stranded region of the template.

The consensus reads produced are unaligned, due to the difficulty and error-prone nature of inferring the consensus alignment. Consensus reads should therefore be aligned after, which should not be too expensive as likely there are significantly fewer consensus reads than input raw reads. Please see how best to use this tool within the best-practice pipeline: https://github.com/fulcrumgenomics/fgbio/blob/main/docs/best-practice-consensus-pipeline.md

Consensus reads have a number of additional optional tags set in the resulting BAM file. The tag names follow a pattern where the first letter (a, b or c) denotes that the tag applies to the first single strand consensus (a), second single-strand consensus (b) or the final duplex consensus (c). The second letter is intended to capture the meaning of the tag (e.g. d=depth, m=min depth, e=errors/error-rate) and is upper case for values that are one per read and lower case for values that are one per base.

The tags break down into those that are single-valued per read:

consensus depth      [aD,bD,cD] (int)  : the maximum depth of raw reads at any point in the consensus reads
consensus min depth  [aM,bM,cM] (int)  : the minimum depth of raw reads at any point in the consensus reads
consensus error rate [aE,bE,cE] (float): the fraction of bases in raw reads disagreeing with the final consensus calls

And those that have a value per base (duplex values are not generated, but can be generated by summing):

consensus depth  [ad,bd] (short[]): the count of bases contributing to each single-strand consensus read at each position
consensus errors [ae,be] (short[]): the count of bases from raw reads disagreeing with the final single-strand consensus base
consensus errors [ac,bc] (string): the single-strand consensus bases
consensus errors [aq,bq] (string): the single-strand consensus qualities

The per base depths and error counts are both capped at 32,767. In all cases no-calls (Ns) and bases below the min-input-base-quality are not counted in tag value calculations.

[1] https://doi.org/10.1038/s41588-023-01376-0

Arguments

Name	Flag	Type	Description	Required?	Max # of Values	Default Value(s)
input	i	PathToBam	The input SAM or BAM file.	Required	1
output	o	PathToBam	Output SAM or BAM file to write consensus reads.	Required	1
rejects	r	PathToBam	Optional output SAM or BAM file to write reads not used.	Optional	1
stats	s	FilePath	Optional output text file of key consensus calling statistics.	Optional	1
read-name-prefix	p	String	The prefix all consensus read names	Optional	1
read-group-id	R	String	The new read group ID for all the consensus reads.	Optional	1	A
error-rate-pre-umi	1	PhredScore	The Phred-scaled error rate for an error prior to the UMIs being integrated.	Optional	1	45
error-rate-post-umi	2	PhredScore	The Phred-scaled error rate for an error post the UMIs have been integrated.	Optional	1	40
min-input-base-quality	m	PhredScore	Ignore bases in raw reads that have Q below this value.	Optional	1	10
sort-order	S	SamOrder	The sort order of the output, the same as the input if not given.	Optional	1
min-read-pairs	M	Int	The minimum number of codec read pairs to form a consensus read.	Optional	1	1
max-read-pairs		Int	The maximum number of reads to use when building a single-strand consensus. If more than this many reads are present in a tag family, the family is randomly downsampled to exactly max-reads-pairs reads.	Optional	1
min-duplex-length	d	Int	Minimum length of the duplex region (where R1 and R2 overlap).	Optional	1	1
single-strand-qual	q	PhredScore	Reduce quality scores in single stranded regions of the consensus read to the given quality.	Optional	1
outer-bases-qual	Q	PhredScore	Reduce the first and last `outer-bases-length` bases to the given quality.	Optional	1
outer-bases-length	O	Int	The number of bases at the start and end of the read to reduce quality over if `outer-bases-qual` is specified.	Optional	1	5
max-duplex-disagreement-rate	x	Double	Discard consensus reads where greater than this fraction of duplex bases disagree.	Optional	1	1.0
max-duplex-disagreements	X	Int	Discard consensus reads where greater than this number of duplex bases disagree.	Optional	1	2147483647
cell-tag	c	String	Tag containing the cellular barcodes.	Optional	1	CB
threads		Int	The number of threads to use while consensus calling.	Optional	1	1