fgbio

Tools for working with genomic and high throughput sequencing data.

View the Project on GitHub

FilterConsensusReads

Overview

Group: Unique Molecular Identifiers (UMIs)

Filters consensus reads generated by CallMolecularConsensusReads or CallDuplexConsensusReads. Two kinds of filtering are performed:

  1. Masking/filtering of individual bases in reads
  2. Filtering out of reads (i.e. not writing them to the output file)

Base-level filtering/masking is only applied if per-base tags are present (see CallDuplexConsensusReads and CallMolecularConsensusReads for descriptions of these tags). Read-level filtering is always applied. When filtering reads, secondary alignments and supplementary records may be removed independently if they fail one or more filters; if either R1 or R2 primary alignments fail a filter then all records for the template will be filtered out.

The filters applied are as follows:

  1. Reads with fewer than min-reads contributing reads are filtered out
  2. Reads with an average consensus error rate higher than max-read-error-rate are filtered out
  3. Reads with mean base quality of the consensus read, prior to any masking, less than min-mean-base-quality are filtered out (if specified)
  4. Bases with quality scores below min-base-quality are masked to Ns
  5. Bases with fewer than min-reads contributing raw reads are masked to Ns
  6. Bases with a consensus error rate (defined as the fraction of contributing reads that voted for a different base than the consensus call) higher than max-base-error-rate are masked to Ns
  7. For duplex reads, if require-single-strand-agreement is provided, masks to Ns any bases where the base was observed in both single-strand consensus reads and the two reads did not agree
  8. Reads with a proportion of Ns higher than max-no-call-fraction after per-base filtering are filtered out

When filtering single-umi consensus reads generated by CallMolecularConsensusReads a single value each should be supplied for --min-reads, --max-read-error-rate, and --max-base-error-rate.

When filtering duplex consensus reads generated by CallDuplexConsensusReads each of the three parameters may independently take 1-3 values. For example:

FilterConsensusReads ... --min-reads 10 5 3 --max-base-error-rate 0.1

In each case if fewer than three values are supplied, the last value is repeated (i.e. 80 40 -> 80 40 40 and 0.1 -> 0.1 0.1 0.1. The first value applies to the final consensus read, the second value to one single-strand consensus, and the last value to the other single-strand consensus. It is required that if values two and three differ, the more stringent value comes earlier.

In order to correctly filter reads in or out by template, the input BAM must be either queryname sorted or query grouped. If your BAM is not already in an appropriate order, this can be done in streaming fashion with:

samtools sort -n -u in.bam | fgbio FilterConsensusReads -i /dev/stdin ...

The output sort order may be specified with --sort-order. If not given, then the output will be in the same order as input.

The --reverse-tags-per-base option controls whether per-base tags should be reversed before being used on reads marked as being mapped to the negative strand. This is necessary if the reads have been mapped and the bases/quals reversed but the consensus tags have not. If true, the tags written to the output BAM will be reversed where necessary in order to line up with the bases and quals.

Arguments

Name Flag Type Description Required? Max # of Values Default Value(s)
input i PathToBam The input SAM or BAM file of consensus reads. Required 1  
output o PathToBam Output SAM or BAM file. Required 1  
ref r PathToFasta Reference fasta file. Required 1  
reverse-per-base-tags R Boolean Reverse [complement] per base tags on reverse strand reads. Optional 1 false
min-reads M Int The minimum number of reads supporting a consensus base/read. Required 3  
max-read-error-rate E Double The maximum raw-read error rate across the entire consensus read. Required 3 0.025
max-base-error-rate e Double The maximum error rate for a single consensus base. Required 3 0.1
min-base-quality N PhredScore Mask (make N) consensus bases with quality less than this threshold. Required 1  
max-no-call-fraction n Double Maximum fraction of no-calls in the read after filtering. Optional 1 0.2
min-mean-base-quality q PhredScore The minimum mean base quality across the consensus read. Optional 1  
require-single-strand-agreement s Boolean Mask (make N) consensus bases where the AB and BA consensus reads disagree (for duplex-sequencing only). Optional 1 false
sort-order S SamOrder The sort order of the output. If not given, output will be in the same order as input if the input is query name sorted or query grouped, otherwise queryname order. Optional 1