Tools for working with genomic and high throughput sequencing data.
Group: FASTQ
Generates an unmapped BAM (or SAM or CRAM) file from fastq files. Takes in one or more fastq files (optionally gzipped), each representing a different sequencing read (e.g. R1, R2, I1 or I2) and can use a set of read structures to allocate bases in those reads to template reads, sample indices, unique molecular indices, or to designate bases to be skipped over.
Read structures are made up of <number><operator>
pairs much like the CIGAR string in BAM files. Four kinds of
operators are recognized:
T
identifies a template readB
identifies a sample barcode readM
identifies a unique molecular index readS
identifies a set of bases that should be skipped or ignoredThe last <number><operator>
pair may be specified using a +
sign instead of number to denote “all remaining
bases”. This is useful if, e.g., fastqs have been trimmed and contain reads of varying length. For example
to convert a paired-end run with an index read and where the first 5 bases of R1 are a UMI and the second
five bases are monotemplate you might specify:
--input r1.fq r2.fq i1.fq --read-structures 5M5S+T +T +B
Alternative if you know your reads are of fixed length you could specify:
--input r1.fq r2.fq i1.fq --read-structures 5M5S65T 75T 8B
For more information on read structures see the Read Structure Wiki Page
UMIs may be extracted from the read sequences, the read names, or both. If --extract-umis-from-read-names
is
specified, any UMIs present in the read names are extracted; read names are expected to be :
-separated with
any UMIs present in the 8th field. If this option is specified, the --umi-qual-tag
option may not be used as
qualities are not available for UMIs in the read name. If UMI segments are present in the read structures those
will also be extracted. If UMIs are present in both, the final UMIs are constructed by first taking the UMIs
from the read names, then adding a hyphen, then the UMIs extracted from the reads.
The same number of input files and read structures must be provided, with one exception: if supplying exactly 1 or 2 fastq files, both of which are solely template reads, no read structures need be provided.
The output file can be sorted by queryname using the --sort-order
option; the default is to produce a BAM
with reads in the same order as they appear in the fastq file.
Name | Flag | Type | Description | Required? | Max # of Values | Default Value(s) |
---|---|---|---|---|---|---|
input | i | PathToFastq | Fastq files corresponding to each sequencing read (e.g. R1, I1, etc.). | Required | Unlimited | |
output | o | PathToBam | The output SAM or BAM file to be written. | Required | 1 | |
read-structures | r | ReadStructure | Read structures, one for each of the FASTQs. | Optional | Unlimited | |
sort | s | Boolean | If true, queryname sort the BAM file, otherwise preserve input order. | Optional | 1 | false |
umi-tag | u | String | Tag in which to store molecular barcodes/UMIs. | Optional | 1 | RX |
umi-qual-tag | q | String | Tag in which to store molecular barcode/UMI qualities. | Optional | 1 | |
store-sample-barcode-qualities | Q | Boolean | Store the sample barcode qualities in the QT Tag. | Optional | 1 | false |
extract-umis-from-read-names | n | Boolean | Extract UMI(s) from read names and prepend to UMIs from reads. | Optional | 1 | false |
read-group-id | String | Read group ID to use in the file header. | Optional | 1 | A | |
sample | String | The name of the sequenced sample. | Required | 1 | ||
library | String | The name/ID of the sequenced library. | Required | 1 | ||
barcode | b | String | Library or Sample barcode sequence. | Optional | 1 | |
platform | String | Sequencing Platform. | Optional | 1 | illumina | |
platform-unit | String | Platform unit (e.g. ‘ |
Optional | 1 | ||
platform-model | String | Platform model to insert into the group header (ex. miseq, hiseq2500, hiseqX) | Optional | 1 | ||
sequencing-center | String | The sequencing center from which the data originated | Optional | 1 | ||
predicted-insert-size | Integer | Predicted median insert size, to insert into the read group header | Optional | 1 | ||
description | String | Description of the read group. | Optional | 1 | ||
comment | String | Comment(s) to include in the output file’s header. | Optional | Unlimited | ||
run-date | Iso8601Date | Date the run was produced, to insert into the read group header | Optional | 1 |