fgbio

Tools for working with genomic and high throughput sequencing data.

View the Project on GitHub

DownsampleAndNormalizeBam

Overview

Group: SAM/BAM

Downsamples a BAM in a biased way to a uniform coverage across regions.

Attempts to downsample a BAM such that every base in the genome (or in the target regions if provided) is covered by at least coverage reads. When computing coverage:

Reads are first sorted into a random order (by hashing read names). Reads are then consumed one template at a time, and if any read adds coverage to base that is under the target coverage, all reads (including secondary, unmapped, etc.) for that template are emitted into the output.

Given the procedure used for downsampling, it is likely the output BAM will have coverage up to 2X the requested coverage at regions in the input BAM that are i) well covered and ii) are close to regions that are poorly covered.

Arguments

Name Flag Type Description Required? Max # of Values Default Value(s)
input i PathToBam Input SAM or BAM file. Required 1  
output o PathToBam Output SAM or BAM file. Required 1  
coverage c Int Desired minimum coverage. Required 1  
min-map-q m Int Minimum mapping quality to count a read as covering. Optional 1 0
seed s Int Random seed to use when randomizing order of reads/templates. Optional 1 42
regions l PathToIntervals Optional set of regions for coverage targeting. Optional 1  
max-in-memory M Int Maximum records to be held in memory while sorting. Optional 1 1000000