fgbio

Tools for working with genomic and high throughput sequencing data.

View the Project on GitHub

CollectDuplexSeqMetrics

Overview

Group: Unique Molecular Identifiers (UMIs)

Collects a suite of metrics to QC duplex sequencing data.

Inputs

The input to this tool must be a BAM file that is either:

  1. The exact BAM output by the GroupReadsByUmi tool (in the sort-order it was produced in)
  2. A BAM file that has MI tags present on all reads (usually set by GroupReadsByUmi and has been sorted with SortBam into TemplateCoordinate order.

Calculation of metrics may be restricted to a set of regions using the --intervals parameter. This can significantly affect results as off-target reads in duplex sequencing experiments often have very different properties than on-target reads due to the lack of enrichment.

Several metrics are calculated related to the fraction of tag families that have duplex coverage. The definition of “duplex” is controlled by the --min-ab-reads and --min-ba-reads parameters. The default is to treat any tag family with at least one observation of each strand as a duplex, but this could be made more stringent, e.g. by setting --min-ab-reads=3 --min-ba-reads=3. If different thresholds are used then --min-ab-reads must be the higher value.

Outputs

The following output files are produced:

  1. **.family_sizes.txt**: metrics on the frequency of different types of families of different sizes
  2. **.duplex_family_sizes.txt**: metrics on the frequency of duplex tag families by the number of observations from each strand
  3. **.duplex_yield_metrics.txt**: summary QC metrics produced using 5%, 10%, 15%...100% of the data
  4. **.umi_counts.txt**: metrics on the frequency of observations of UMIs within reads and tag families
  5. **.duplex_qc.pdf**: a series of plots generated from the preceding metrics files for visualization
  6. **.duplex_umi_counts.txt**: (optional) metrics on the frequency of observations of duplex UMIs within reads and tag families. This file is only produced _if_ the `--duplex-umi-counts` option is used as it requires significantly more memory to track all pairs of UMIs seen when a large number of UMI sequences are present.

Within the metrics files the prefixes CS, SS and DS are used to mean:

Requirements

For plots to be generated R must be installed and the ggplot2 package installed with suggested dependencies. Successfully executing the following in R will ensure a working installation:

install.packages("ggplot2", repos="http://cran.us.r-project.org", dependencies=TRUE)

Arguments

Name Flag Type Description Required? Max # of Values Default Value(s)
input i PathToBam Input BAM file generated by GroupReadsByUmi. Required 1  
output o PathPrefix Prefix of output files to write. Required 1  
intervals l PathToIntervals Optional set of intervals over which to restrict analysis. Optional 1  
description d String Description of data set used to label plots. Defaults to sample/library. Optional 1  
duplex-umi-counts u Boolean If true, produce the .duplex_umi_counts.txt file with counts of duplex UMI observations. Optional 1 false
min-ab-reads a Int Minimum AB reads to call a tag family a ‘duplex’. Optional 1 1
min-ba-reads b Int Minimum BA reads to call a tag family a ‘duplex’. Optional 1 1
umi-tag t String The tag containing the raw UMI. Optional 1 RX
mi-tag T String The output tag for UMI grouping. Optional 1 MI