CollectDuplexSeqMetrics

Overview

Group: Unique Molecular Identifiers (UMIs)

Collects a suite of metrics to QC duplex sequencing data.

Inputs

The input to this tool must be a BAM file that is either:

The exact BAM output by the GroupReadsByUmi tool (in the sort-order it was produced in)
A BAM file that has MI tags present on all reads (usually set by GroupReadsByUmi and has been sorted with SortBam into TemplateCoordinate order.

Calculation of metrics may be restricted to a set of regions using the --intervals parameter. This can significantly affect results as off-target reads in duplex sequencing experiments often have very different properties than on-target reads due to the lack of enrichment.

Several metrics are calculated related to the fraction of tag families that have duplex coverage. The definition of “duplex” is controlled by the --min-ab-reads and --min-ba-reads parameters. The default is to treat any tag family with at least one observation of each strand as a duplex, but this could be made more stringent, e.g. by setting --min-ab-reads=3 --min-ba-reads=3. If different thresholds are used then --min-ab-reads must be the higher value.

Outputs

The following output files are produced:

**.family_sizes.txt**: metrics on the frequency of different types of families of different sizes
**.duplex_family_sizes.txt**: metrics on the frequency of duplex tag families by the number of observations from each strand
**.duplex_yield_metrics.txt**: summary QC metrics produced using 5%, 10%, 15%...100% of the data
**.umi_counts.txt**: metrics on the frequency of observations of UMIs within reads and tag families
**.duplex_qc.pdf**: a series of plots generated from the preceding metrics files for visualization
**.duplex_umi_counts.txt**: (optional) metrics on the frequency of observations of duplex UMIs within reads and tag families. This file is only produced _if_ the `--duplex-umi-counts` option is used as it requires significantly more memory to track all pairs of UMIs seen when a large number of UMI sequences are present.

Within the metrics files the prefixes CS, SS and DS are used to mean:

CS: tag families where membership is defined solely on matching genome coordinates and strand
SS: single-stranded tag families where membership is defined by genome coordinates, strand and UMI; ie. 50/A and 50/B are considered different tag families.
DS: double-stranded tag families where membership is collapsed across single-stranded tag families from the same double-stranded source molecule; i.e. 50/A and 50/B become one family

Requirements

For plots to be generated R must be installed and the ggplot2 package installed with suggested dependencies. Successfully executing the following in R will ensure a working installation:

install.packages("ggplot2", repos="http://cran.us.r-project.org", dependencies=TRUE)

Arguments

Name	Flag	Type	Description	Required?	Max # of Values	Default Value(s)
input	i	PathToBam	Input BAM file generated by `GroupReadsByUmi`.	Required	1
output	o	PathPrefix	Prefix of output files to write.	Required	1
intervals	l	PathToIntervals	Optional set of intervals over which to restrict analysis.	Optional	1
description	d	String	Description of data set used to label plots. Defaults to sample/library.	Optional	1
duplex-umi-counts	u	Boolean	If true, produce the .duplex_umi_counts.txt file with counts of duplex UMI observations.	Optional	1	false
min-ab-reads	a	Int	Minimum AB reads to call a tag family a ‘duplex’.	Optional	1	1
min-ba-reads	b	Int	Minimum BA reads to call a tag family a ‘duplex’.	Optional	1	1
umi-tag	t	String	The tag containing the raw UMI.	Optional	1	RX
mi-tag	T	String	The output tag for UMI grouping.	Optional	1	MI