fgbio Metrics Descriptions

This page contains descriptions of all metrics produced by all fgbio tools. Within the descriptions the type of each field/column is given, including two commonly used types:

Count is an integer representing the count of some item
Proportion is a real number with a value between 0 and 1 representing a proportion or fraction

Metric Type	Description
Amplicon	A Locatable Amplicon class
AssessPhasingMetric	Metrics produced by `AssessPhasing` describing various statistics assessing the performance of phasing variants relative to a known set of phased variant calls
AssignPrimersMetric	Metrics produced by `AssignPrimers` that detail how many reads were assigned to a given primer and/or amplicon
CallOverlappingConsensusBasesMetric	Collects the the number of reads or bases that were examined, had overlap, and were corrected as part of the CallOverlappingConsensusBases tool
ClippingMetrics	Metrics produced by ClipBam that detail how many reads and bases are clipped respectively
ConsensusVariantReviewInfo	Detailed information produced by `ReviewConsensusVariants` on variants called in consensus reads
DuplexFamilySizeMetric	Metrics produced by `CollectDuplexSeqMetrics` to describe the distribution of double-stranded (duplex) tag families in terms of the number of reads observed on each strand
DuplexUmiMetric	Metrics produced by `CollectDuplexSeqMetrics` describing the set of observed duplex UMI sequences and the frequency of their observations
DuplexYieldMetric	Metrics produced by `CollectDuplexSeqMetrics` that are sampled at various levels of coverage, via random downsampling, during the construction of duplex metrics
ErccDetailedMetric	Metrics produced by `CollectErccMetrics` describing various per-transcript metrics related to the spike-in of ERCC (External RNA Controls Consortium) into an RNA-Seq experiment
ErccSummaryMetrics	Metrics produced by `CollectErccMetrics` describing various summary metrics related to the spike-in of ERCC (External RNA Controls Consortium) into an RNA-Seq experiment
ErrorRateByReadPositionMetric	Metrics produced by `ErrorRateByReadPosition` describing the number of base observations and substitution errors at each position within each sequencing read
FamilySizeMetric	Metrics produced by `CollectDuplexSeqMetrics` to quantify the distribution of different kinds of read family sizes
InsertSizeMetric	Metrics produced by `EstimateRnaSeqInsertSize` to describe the distribution of insert sizes within an RNA-seq experiment
PhaseBlockLengthMetric	Metrics produced by `AssessPhasing` describing the number of phased blocks of a given length
PoolingFractionMetric	Metrics produced by `EstimatePoolingFractions` to quantify the estimated proportion of a sample mixture that is attributable to a specific sample with a known set of genotypes
RunInfo	Stores the result of parsing the run info (RunInfo
SampleBarcodeMetric	Metrics for matching templates to sample barcodes primarily used in com
SwitchMetric	Summary metrics regarding switchback reads found
TagFamilySizeMetric	Metrics produced by `GroupReadsByUmi` to describe the distribution of tag family sizes observed during grouping
UmiCorrectionMetrics	Metrics produced by `CorrectUmis` regarding the correction of UMI sequences to a fixed set of known UMIs
UmiGroupingMetric	Metrics produced by `GroupReadsByUmi` to describe reads passed through UMI grouping
UmiMetric	Metrics produced by `CollectDuplexSeqMetrics` describing the set of observed UMI sequences and the frequency of their observations

Metric File Descriptions

Amplicon

A Locatable Amplicon class.

Column	Type	Description
chrom	String	The chromosome for the amplicon
left_start	Int	The 1-based start position of the left-most primer
left_end	Int	The 1-based end position inclusive of the left-most primer
right_start	Int	The 1-based start position of the right-most primer
right_end	Int	The 1-based end position inclusive of the right-most primer
id	Option[String]

AssessPhasingMetric

Metrics produced by AssessPhasing describing various statistics assessing the performance of phasing variants relative to a known set of phased variant calls. Included are methods for assessing sensitivity and accuracy from a number of previous papers (ex. http://dx.doi.org/10.1038%2Fng.3119).The N50, N90, and L50 statistics are defined as follows:

The N50 is the longest block length such that the bases covered by all blocks this length and longer are at least 50% of the # of bases covered by all blocks.
The N90 is the longest block length such that the bases covered by all blocks this length and longer are at least 90% of the # of bases covered by all blocks.
The L50 is the smallest number of blocks such that the sum of the lengths of the blocks is >= 50% of the sum of the lengths of all blocks.

Column	Type	Description
num_called	Long	The number of variants called.
num_phased	Long	The number of variants called with phase.
num_truth	Long	The number of variants with known truth genotypes.
num_truth_phased	Long	The number of variants with known truth genotypes with phase.
num_called_with_truth_phased	Long	The number of variants called that had a known phased genotype.
num_phased_with_truth_phased	Long	The number of variants called with phase that had a known phased genotype.
num_truth_phased_in_called_block	Long	The number of known phased variants that were in a called phased block.
num_both_phased_in_called_block	Long	The number of called phase variants that had a known phased genotype in a called phased block.
num_short_switch_errors	Long	The number of short switch errors (isolated switch errors).
num_long_switch_errors	Long	The number of long switch errors (# of runs of consecutive switch errors).
num_switch_sites	Long	The number of sites that could be (short or long) switch errors (i.e. the # of sites with both known and called phased variants).
num_illumina_point_switch_errors	Long	The number of point switch errors (defined in http://dx.doi.org/10.1038%2Fng.3119).
num_illumina_long_switch_errors	Long	The number of long switch errors (defined in http://dx.doi.org/10.1038%2Fng.3119).
num_illumina_switch_sites	Long	The number of sites that could be (point or long) switch errors (defined in http://dx.doi.org/10.1038%2Fng.3119).
frac_phased	Double	The fraction of called variants with phase.
frac_phased_with_truth_phased	Double	The fraction of known phased variants called with phase.
frac_truth_phased_in_called_block	Double	The fraction of phased known genotypes in a called phased block.
frac_phased_with_truth_phased_in_called_block	Double	The fraction of called phased variants that had a known phased genotype in a called phased block.
short_accuracy	Double	The fraction of switch sites without short switch errors (`1 - (num_short_switch_errors / num_switch_sites)`).
long_accuracy	Double	The fraction of switch sites without long switch errors (`1 - (num_long_switch_errors / num_switch_sites)`).
illumina_point_accuracy	Double	The fraction of switch sites without point switch errors according to the Illumina method defining switch sites and errors (`1 - (num_illumina_point_switch_errors / num_illumina_switch_sites )`).
illumina_long_accuracy	Double	The fraction of switch sites wihtout long switch errors according to the Illumina method defining switch sites and errors (`1 - (num_illumina_long_switch_errors / num_illumina_switch_sites )`).
mean_called_block_length	Double	The mean phased block length in the callset.
median_called_block_length	Double	The median phased block length in the callset.
stddev_called_block_length	Double	The standard deviation of the phased block length in the callset.
n50_called_block_length	Double	The N50 of the phased block length in the callset.
n90_called_block_length	Double	The N90 of the phased block length in the callset.
l50_called	Double	The L50 of the phased block length in the callset.
mean_truth_block_length	Double	The mean phased block length in the truth.
median_truth_block_length	Double	The median phased block length in the truth.
stddev_truth_block_length	Double	The standard deviation of the phased block length in the truth.
n50_truth_block_length	Double	The N50 of the phased block length in the truth.
n90_truth_block_length	Double	The N90 of the phased block length in the callset.
l50_truth	Double	The L50 of the phased block length in the callset.

AssignPrimersMetric

Metrics produced by AssignPrimers that detail how many reads were assigned to a given primer and/or amplicon.

Column	Type	Description
identifier	String	The amplicon identifier this metric collects over
left	Long	The number of reads assigned to the left primer
right	Long	The number of reads assigned to the right primer
r1s	Long	The number of R1 reads assigned to this amplicon
r2s	Long	The number of R2 reads assigned to this amplicon
pairs	Long	The number of read pairs where R1 and R2 are both assigned to the this amplicon and are in FR orientation
frac_left	Double	The fraction of reads assigned to the left primer
frac_right	Double	The fraction of reads assigned to the right primer
frac_r1s	Double	The fraction of R1s reads assigned to this amplicon
frac_r2s	Double	The fraction of R2s reads assigned to this amplicon
frac_pairs	Double	The fraction of read pairs where R1 and R2 are both assigned to the this amplicon and are in FR orientation

CallOverlappingConsensusBasesMetric

Collects the the number of reads or bases that were examined, had overlap, and were corrected as part of the CallOverlappingConsensusBases tool.

Column	Type	Description
kind	CountKind	Template if the counts are per template, bases if counts are in units of bases.
total	Long	The total number of templates (bases) examined
overlapping	Long	The total number of templates (bases) that were overlapping
corrected	Long	The total number of templates (bases) that were corrected.

ClippingMetrics

Metrics produced by ClipBam that detail how many reads and bases are clipped respectively.

Column	Type	Description
read_type	ReadType	The type of read (i.e. Fragment, ReadOne, ReadTwo).
reads	Long	The number of reads examined.
reads_unmapped	Long	The number of reads that became unmapped due to clipping.
reads_clipped_pre	Long	The number of reads with any type of clipping prior to clipping with ClipBam.
reads_clipped_post	Long	The number of reads with any type of clipping after clipping with ClipBam, including reads that became unmapped.
reads_clipped_five_prime	Long	The number of reads with the 5’ end clipped.
reads_clipped_three_prime	Long	The number of reads with the 3’ end clipped.
reads_clipped_overlapping	Long	The number of reads clipped due to overlapping reads.
reads_clipped_extending	Long	The number of reads clipped due to a read extending past its mate.
bases	Long	The number of aligned bases after clipping.
bases_clipped_pre	Long	The number of bases clipped prior to clipping with ClipBam.
bases_clipped_post	Long	The number of bases clipped after clipping with ClipBam, including bases from reads that became unmapped.
bases_clipped_five_prime	Long	The number of bases clipped on the 5’ end of the read.
bases_clipped_three_prime	Long	The number of bases clipped on the 3 end of the read.
bases_clipped_overlapping	Long	The number of bases clipped due to overlapping reads.
bases_clipped_extending	Long	The number of bases clipped due to a read extending past its mate.

ConsensusVariantReviewInfo

Detailed information produced by ReviewConsensusVariants on variants called in consensus reads. Each row contains information about a consensus read that carried a variant or non-reference allele at a particular variant site.The first 10 columns (up to N) contain information about the variant site and are repeated for each consensus read reported at that site. The remaining fields are specific to the consensus read.

Column	Type	Description
chrom	String	The chromosome on which the variant exists.
pos	Int	The position of the variant.
ref	String	The reference allele at the position.
genotype	String	The genotype of the sample in question.
filters	String	The set of filters applied to the variant in the VCF.
A	Int	The count of A observations at the variant locus across all consensus reads.
C	Int	The count of C observations at the variant locus across all consensus reads.
G	Int	The count of G observations at the variant locus across all consensus reads.
T	Int	The count of T observations at the variant locus across all consensus reads.
N	Int	The count of N observations at the variant locus across all consensus reads.
consensus_read	String	The consensus read name for which the following fields contain values.
consensus_insert	String	A description of the insert that generated the consensus read.
consensus_call	Char	The base call from the consensus read.
consensus_qual	Int	The quality score from the consensus read.
a	Int	The number of As in raw-reads contributing to the consensus base call at the variant site.
c	Int	The number of Cs in raw-reads contributing to the consensus base call at the variant site.
g	Int	The number of Gs in raw-reads contributing to the consensus base call at the variant site.
t	Int	The number of Ts in raw-reads contributing to the consensus base call at the variant site.
n	Int	The number of Ns in raw-reads contributing to the consensus base call at the variant site.

DuplexFamilySizeMetric

Metrics produced by CollectDuplexSeqMetrics to describe the distribution of double-stranded (duplex) tag families in terms of the number of reads observed on each strand.We refer to the two strands as ab and ba because we identify the two strands by observing the same pair of UMIs (A and B) in opposite order (A->B vs B->A). Which strand is ab and which is ba is largely arbitrary, so to make interpretation of the metrics simpler we use a definition here that for a given tag family ab is the sub-family with more reads and ba is the tag family with fewer reads.

Column	Type	Description
ab_size	Int	The number of reads in the `ab` sub-family (the larger sub-family) for this double-strand tag family.
ba_size	Int	The number of reads in the `ba` sub-family (the smaller sub-family) for this double-strand tag family.
count	Count	The number of families with the `ab` and `ba` single-strand families of size `ab_size` and `ba_size`.
fraction	Proportion	The fraction of all double-stranded tag families that have `ab_size` and `ba_size`.
fraction_gt_or_eq_size	Proportion	The fraction of all double-stranded tag families that have `ab reads >= ab_size` and `ba reads >= ba_size`.

DuplexUmiMetric

Metrics produced by CollectDuplexSeqMetrics describing the set of observed duplex UMI sequences and the frequency of their observations. The UMI sequences reported may have been corrected using information within a double-stranded tag family. For example if a tag family is comprised of three read pairs with UMIs ACGT-TGGT, ACGT-TGGT, and ACGT-TGGG then a consensus UMI of ACGT-TGGT will be generated.UMI pairs are normalized within a tag family so that observations are always reported as if they came from a read pair with read 1 on the positive strand (F1R2). Another way to view this is that for FR or RF read pairs, the duplex UMI reported is the UMI from the positive strand read followed by the UMI from the negative strand read. E.g. a read pair with UMI AAAA-GGGG and with R1 on the negative strand and R2 on the positive strand, will be reported as GGGG-AAAA.

Column	Type	Description
umi	String	The duplex UMI sequence, possibly-corrected.
raw_observations	Count	The number of read pairs in the input BAM that observe the duplex UMI (after correction).
raw_observations_with_errors	Count	The subset of raw observations that underwent any correction.
unique_observations	Count	The number of double-stranded tag families (i.e unique double-stranded molecules) that observed the duplex UMI.
fraction_raw_observations	Proportion	The fraction of all raw observations that the duplex UMI accounts for.
fraction_unique_observations	Proportion	The fraction of all unique observations that the duplex UMI accounts for.
fraction_unique_observations_expected	Proportion	The fraction of all unique observations that are expected to be attributed to the duplex UMI based on the `fraction_unique_observations` of the two individual UMIs.

DuplexYieldMetric

Metrics produced by CollectDuplexSeqMetrics that are sampled at various levels of coverage, via random downsampling, during the construction of duplex metrics. The downsampling is done in such a way that the fractions are approximate, and not exact, therefore the fraction field should only be interpreted as a guide and the read_pairs field used to quantify how much data was used.See FamilySizeMetric for detailed definitions of CS, SS and DS as used below.

Column	Type	Description
fraction	Proportion	The approximate fraction of the full dataset that was used to generate the remaining values.
read_pairs	Count	The number of read pairs upon which the remaining metrics are based.
cs_families	Count	The number of CS (Coordinate & Strand) families present in the data.
ss_families	Count	The number of SS (Single-Strand by UMI) families present in the data.
ds_families	Count	The number of DS (Double-Strand by UMI) families present in the data.
ds_duplexes	Count	The number of DS families that had the minimum number of observations on both strands to be called duplexes (default = 1 read on each strand).
ds_fraction_duplexes	Proportion	The fraction of DS families that are duplexes (`ds_duplexes / ds_families`).
ds_fraction_duplexes_ideal	Proportion	The fraction of DS families that should be duplexes under an idealized model where each strand, `A` and `B`, have equal probability of being sampled, given the observed distribution of DS family sizes.

ErccDetailedMetric

Metrics produced by CollectErccMetrics describing various per-transcript metrics related to the spike-in of ERCC (External RNA Controls Consortium) into an RNA-Seq experiment. One metric per ERCC transcript will be present.

Column	Type	Description
name	String	The name (or ID) of the ERCC transcript.
concentration	Double	The expected concentration as input to `CollectErccMetrics`.
count	Long	The observed count of the number of read pairs (or single end reads) .
normalized_count	Double	The observed count of the number of read pairs (or single end reads) normalized by the ERCC transcript length.

ErccSummaryMetrics

Metrics produced by CollectErccMetrics describing various summary metrics related to the spike-in of ERCC (External RNA Controls Consortium) into an RNA-Seq experiment.The correlation coefficients and linear regression are calculated based on the log2 observed read pair count normalized by ERCC transcript length versus the log2 expected concentration.

Column	Type	Description
total_reads	Long	The total number of reads considered.
ercc_reads	Long	The total number of reads mapping to an ERCC transcript.
fraction_ercc_reads	Double	The fraction of total reads that map to an ERCC transcript.
ercc_templates	Long	The total number of read pairs (or single end reads) mapping to an ERCC transcript.
total_transcripts	Int	The total number of ERCC transcripts with at least one read observed.
passing_filter_transcripts	Int	The total number of ERCC transcripts with at least the user-set minimum # of reads observed.
pearsons_correlation	Option[Double]	Pearson’s correlation coefficient for correlation of concentration and normalized counts.
spearmans_correlation	Option[Double]	Spearman’s correlation coefficient for correlation of concentration and normalized counts.
intercept	Option[Double]	The intercept of the linear regression.
slope	Option[Double]	The slope of the linear regression.
r_squared	Option[Double]	The r-squared of the linear regression.

ErrorRateByReadPositionMetric

Metrics produced by ErrorRateByReadPosition describing the number of base observations and substitution errors at each position within each sequencing read. Error rates are given for the overall substitution error rate and also for each kind of substitution separately.If collapsed is true, then complementary substitutions are grouped together into the first 6 error rates. e.g. T>G substitutions are reported as A>C. Otherwise, all 12 substitution rates are reported.

Column	Type	Description
read_number	Int	The read number (0 for fragments, 1 for first of pair, 2 for second of pair).
position	Int	The position or cycle within the read (1-based).
bases_total	Count	The total number of bases observed at this position.
errors	Count	The total number of errors or non-reference basecalls observed at this position.
error_rate	Double	The overall error rate at position.
a_to_c_error_rate	Double	The rate of `A>C` (and `T>G` when collapsed) errors at the position.
a_to_g_error_rate	Double	The rate of `A>G` (and `T>C` when collapsed) errors at the position.
a_to_t_error_rate	Double	The rate of `A>T` (and `T>A` when collapsed) errors at the position.
c_to_a_error_rate	Double	The rate of `C>A` (and `G>T` when collapsed) errors at the position.
c_to_g_error_rate	Double	The rate of `C>G` (and `G>C` when collapsed) errors at the position.
c_to_t_error_rate	Double	The rate of `C>T` (and `G>A` when collapsed) errors at the position.
g_to_a_error_rate	Option[Double]	The rate of `G>A` errors at the position.
g_to_c_error_rate	Option[Double]	The rate of `G>C` errors at the position.
g_to_t_error_rate	Option[Double]	The rate of `G>T` errors at the position.
t_to_a_error_rate	Option[Double]	The rate of `T>A` errors at the position.
t_to_c_error_rate	Option[Double]	The rate of `T>C` errors at the position.
t_to_g_error_rate	Option[Double]	The rate of `T>T` errors at the position.
collapsed	Boolean

FamilySizeMetric

Metrics produced by CollectDuplexSeqMetrics to quantify the distribution of different kinds of read family sizes. Three kinds of families are described:1. CS or Coordinate & Strand: families of reads that are grouped together by their unclipped 5’ genomic positions and strands just as they are in traditional PCR duplicate marking

SS or Single Strand: single-strand families that are each subsets of a CS family create by also using the UMIs to partition the larger family, but not linking up families that are created from opposing strands of the same source molecule.
DS or Double Strand: families that are created by combining single-strand families that are from opposite strands of the same source molecule. This does not imply that all DS families are composed of reads from both strands; where only one strand of a source molecule is observed a DS family is still counted.

Column	Type	Description
family_size	Int	The family size, i.e. the number of read pairs grouped together into a family.
cs_count	Count	The count of families with `size == family_size` when grouping just by coordinates and strand information.
cs_fraction	Proportion	The fraction of all CS families where `size == family_size`.
cs_fraction_gt_or_eq_size	Proportion	The fraction of all CS families where `size >= family_size`.
ss_count	Count	The count of families with `size == family_size` when also grouping by UMI to create single-strand families.
ss_fraction	Proportion	The fraction of all SS families where `size == family_size`.
ss_fraction_gt_or_eq_size	Proportion	The fraction of all SS families where `size >= family_size`.
ds_count	Count	The count of families with `size == family_size`when also grouping by UMI and merging single-strand families from opposite strands of the same source molecule.
ds_fraction	Proportion	The fraction of all DS families where `size == family_size`.
ds_fraction_gt_or_eq_size	Proportion	The fraction of all DS families where `size >= family_size`.

InsertSizeMetric

Metrics produced by EstimateRnaSeqInsertSize to describe the distribution of insert sizes within an RNA-seq experiment. The insert sizes are computed in “transcript space”, accounting for spliced alignments, in order to get a true estimate of the size of the DNA fragment, not just it’s span on the genome.

Column	Type	Description
pair_orientation	PairOrientation	The orientation of the reads within a read-pair relative to each other. Possible values are FR, RF and TANDEM.
read_pairs	Long	The number of read pairs observed with the `pair_orientation`.
mean	Double	The mean insert size of the read pairs.
standard_deviation	Double	The standard deviation of the insert size of the read pairs.
median	Double	The median insert size of the read pairs.
min	Long	The minimum observed insert size of the read pairs.
max	Long	The maximum observed insert size of the read pairs.
median_absolute_deviation	Double	The median absolution deviation of the read pairs.

PhaseBlockLengthMetric

Metrics produced by AssessPhasing describing the number of phased blocks of a given length. The output will have multiple rows, one for each observed phased block length.

Column	Type	Description
dataset	String	The name of the dataset being assessed (i.e. “truth” or “called”).
length	Long	The length of the phased block.
count	Long	The number of phased blocks of the given length.

PoolingFractionMetric

Metrics produced by EstimatePoolingFractions to quantify the estimated proportion of a sample mixture that is attributable to a specific sample with a known set of genotypes.

Column	Type	Description
sample	String	The name of the sample within the pool being reported on.
variant_sites	Count	How many sites were examined at which the reported sample is known to be variant.
singletons	Count	How many of the variant sites were sites at which only this sample was variant.
estimated_fraction	Proportion	The estimated fraction of the pool that comes from this sample.
standard_error	Double	The standard error of the estimated fraction.
ci99_low	Proportion	The lower bound of the 99% confidence interval for the estimated fraction.
ci99_high	Proportion	The upper bound of the 99% confidence interval for the estimated fraction.

RunInfo

Stores the result of parsing the run info (RunInfo.xml) file from an Illumina run folder.

Column	Type	Description
run_barcode	String	The unique identifier for the sequencing run and flowcell, stored as “_".
flowcell_barcode	String	The flowcell barcode.
instrument_name	String	The instrument name.
run_date	Iso8601Date	The date of the sequencing run.
read_structure	ReadStructure	The description of the logical structure of cycles within the sequencing run. This will only contain template and sample barcode segments, as the RunInfo.xml does not contain information about other segments (i.e. molecular barcodes and skips).
num_lanes	Int	The number of lanes in the flowcell.

SampleBarcodeMetric

Metrics for matching templates to sample barcodes primarily used in com.fulcrumgenomics.fastq.DemuxFastqs.The number of templates will match the number of reads for an Illumina single-end sequencing run, while the number of templates will be half the number of reads for an Illumina paired-end sequencing run (i.e. R1 & R2 observe the same template).

Column	Type	Description
barcode_name	String	The name for the sample barcode, typically the sample name from the SampleSheet.
library_name	String	The name of the library, typically the library identifier from the SampleSheet.
barcode	String	The sample barcode bases. Dual index barcodes will have two sample barcode sequences delimited by a dash.
templates	Count	The total number of templates matching the given barcode.
pf_templates	Count	The total number of pass-filter templates matching the given barcode.
perfect_matches	Count	The number of templates that match perfectly the given barcode.
pf_perfect_matches	Count	The number of pass-filter templates that match perfectly the given barcode.
one_mismatch_matches	Count	The number of pass-filter templates that match the given barcode with exactly one mismatch.
pf_one_mismatch_matches	Count	The number of pass-filter templates that match the given barcode with exactly one mismatch.
q20_bases	Count	The number of bases in a template with a quality score 20 or above
q30_bases	Count	The number of bases in a template with a quality score 30 or above
total_number_of_bases	Count	The total number of bases in the templates combined
fraction_matches	Proportion	The fraction of all templates that match the given barcode.
ratio_this_barcode_to_best_barcode	Proportion	The rate of all templates matching this barcode to all template reads matching the most prevalent barcode. For the most prevalent barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more unmatched templates than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation between barcodes.
pf_fraction_matches	Proportion	The fraction of all pass-filter templates that match the given barcode.
pf_ratio_this_barcode_to_best_barcode	Proportion	The rate of all pass-filter templates matching this barcode to all templates matching the most prevalent barcode. For the most prevalent barcode this will be 1, for all others it will be less than 1 (except for the possible exception of when there are more unmatched templates than for any other barcode, in which case the value may be arbitrarily large). One over the lowest number in this column gives you the fold-difference in representation between barcodes.
pf_normalized_matches	Proportion	The “normalized” matches to each barcode. This is calculated as the number of pass-filter templates matching this barcode over the mean of all pass-filter templates matching any barcode (excluding unmatched). If all barcodes are represented equally this will be
frac_q20_bases	Proportion	The fraction of bases in a template with a quality score 20 or above
frac_q30_bases	Proportion	The fraction of bases in a template with a quality score 30 or above

SwitchMetric

Summary metrics regarding switchback reads found.

Column	Type	Description
sample	String	The name of the sample sequenced.
library	String	The name of the library sequenced.
templates	Count	The total number of templates (i.e. inserts, unique read names) seen in the input.
aligned_templates	Count	The number of templates that had at least one aligned read.
switchback_templates	Count	The number of templates identified as having a switchback event.
fraction_switchbacks	Proportion	The fraction of all templates that appear to have switchbacks in them.
read_based_switchbacks	Count	The count of switchback_templates that were identified by looking for soft-clipped reverse complementary sequence at the ends of reads.
mean_length	Double	The mean length of the reverse complementary sequence in the `read_based_switchbacks`.
mean_offset	Double	The mean offset of the `read_based_switchbacks`.
tandem_based_switchbacks	Count	The count of switchback_templates that were identified because they had paired reads mapped in FF or RR orientations within the designated gap size range.
mean_gap	Double	The mean gap size for `tandem_based_switchbacks`

TagFamilySizeMetric

Metrics produced by GroupReadsByUmi to describe the distribution of tag family sizes observed during grouping.

Column	Type	Description
family_size	Int	The family size, or number of templates/read-pairs belonging to the family.
count	Count	The number of families (or source molecules) observed with `family_size` observations.
fraction	Proportion	The fraction of all families of all sizes that have this specific `family_size`.
fraction_gt_or_eq_family_size	Proportion	The fraction of all families that have `>= family_size`.

UmiCorrectionMetrics

Metrics produced by CorrectUmis regarding the correction of UMI sequences to a fixed set of known UMIs.

Column	Type	Description
umi	String	The corrected UMI sequence (or all `N`s for unmatched).
total_matches	Count	The number of UMI sequences that matched/were corrected to this UMI.
perfect_matches	Count	The number of UMI sequences that were perfect matches to this UMI.
one_mismatch_matches	Count	The number of UMI sequences that matched with a single mismatch.
two_mismatch_matches	Count	The number of UMI sequences that matched with two mismatches.
other_matches	Count	The number of UMI sequences that matched with three or more mismatches.
fraction_of_matches	Proportion	The fraction of all UMIs that matched or were corrected to this UMI.
representation	Double	The `total_matches` for this UMI divided by the mean `total_matches` for all UMIs.

UmiGroupingMetric

Metrics produced by GroupReadsByUmi to describe reads passed through UMI grouping

Column	Type	Description
accepted_sam_records	Long	The number of SAM records accepted for grouping.
discarded_non_pf	Long	The number of discarded non-PF SAM records.
discarded_poor_alignment	Long	The number of SAM records discarded for poor alignment.
discarded_ns_in_umi	Long	The number of SAM records discarded due to one or more Ns in the UMI.
discarded_umis_to_short	Long	The number of SAM records discarded due to a shorter than expected UMI.

UmiMetric

Metrics produced by CollectDuplexSeqMetrics describing the set of observed UMI sequences and the frequency of their observations. The UMI sequences reported may have been corrected using information within a double-stranded tag family. For example if a tag family is comprised of three read pairs with UMIs ACGT-TGGT, ACGT-TGGT, and ACGT-TGGG then a consensus UMI of ACGT-TGGT will be generated, and three raw observations counted for each of ACGT and TGGT, and no observations counted for TGGG.

Column	Type	Description
umi	String	The UMI sequence, possibly-corrected.
raw_observations	Count	The number of read pairs in the input BAM that observe the UMI (after correction).
raw_observations_with_errors	Count	The subset of raw-observations that underwent any correction.
unique_observations	Count	The number of double-stranded tag families (i.e unique double-stranded molecules) that observed the UMI.
fraction_raw_observations	Proportion	The fraction of all raw observations that the UMI accounts for.
fraction_unique_observations	Proportion	The fraction of all unique observations that the UMI accounts for.