View the project on GitHub. jakob-schuster/matchbox
Often, we have a reference file of known sequences such as barcodes, and we want to assign each read to its barcode.
Before loading a reference file, it can be useful to quantify the barcodes that are actually present in the reads. This may be used to create a subset of the barcode reference file, as per the approach of flexiplex.
If the barcode flanks a known sequence:
primer = AGCTAGTCGATGC
# build a count matrix of all the raw sequences
# found at the expected barcode location
if read is [_ primer bc:|16| _] => bc.seq.count!()
Typically, demultiplexing involves a list of known barcodes
If your barcodes are in FASTA format, the fasta
function can be used. Then, to iterate over the barcodes, a pattern can be parameterised using for
:
primer = AGCTAGTCGATGC
bcs = fasta('my_barcodes.fa')
if read is [_ primer bc.seq _] for bc in bcs =>
read.tag('barcode={bc.id}')
.out!('demultiplexed.fq')
Alternatively, CSV or TSV files can be used. The values returned by the csv
and tsv
functions contain fields as named in the CSV/TSV header. Hence, for the following CSV, the fields would be barcode_name
and sequence
.
barcode_name,sequence
bc1,AAAA
bc2,GGGG
primer = AGCTAGTCGATGC
bcs = csv('my_barcodes.csv')
if read is [_ primer bc.sequence _] for bc in bcs =>
read.tag('barcode={bc.barcode_name}')
.out!('demultiplexed.fq')
In the above examples, each read is tagged with the name of its barcode, and all successfully demultiplexed reads are aggregated into demultiplexed.fq
.
info |
The fasta , csv and tsv functions load the entire reference file into memory at the start of execution. matchbox currently doesn't support reference files which are too large to load into memory!
|
It can be preferable to create separate files for each barcode:
primer = AGCTAGTCGATGC
bcs = fasta('my_barcodes.fa')
if read is [_ primer bc.seq _] for bc in bcs =>
read.out!('{bc.id}_reads.fq')
Sometimes, multiple distinct barcodes are present in a read.
If two barcodes come from separate barcode lists, two references can be loaded in. Multiple for
parameters can be given, separated by commas:
primer = AGCTAGTCGATGC
bcs1 = fasta('my_barcodes1.fa')
bcs2 = fasta('my_barcodes2.fa')
if read is [_ primer bc1.seq linker bc2.seq _]
for bc1 in bcs1, bc2 in bcs2 =>
read.tag('barcode1={bc1.name}')
.tag('barcode2={bc2.name}')
.out!('demultiplexed.fq')
If two barcodes are distinct barcodes from the same reference list, we can load the reference file once and then create two parameters over it:
primer = AGCTAGTCGATGC
bcs = fasta('my_barcodes.fa')
if read is [_ primer bc1.seq linker bc2.seq _]
for bc1 in bcs, bc2 in bcs =>
read.tag('barcode1={bc1.name}')
.tag('barcode2={bc2.name}')
.out!('demultiplexed.fq')
If we expect to find the exact same barcode in two places within a read, we can use a single parameter and search for it in multiple places in the read pattern:
primer = AGCTAGTCGATGC
bcs = fasta('my_barcodes.fa')
if read is [_ primer bc.seq linker bc.seq _]
for bc in bcs =>
read.tag('barcode={bc.name}')
.out!('demultiplexed.fq')
Sometimes a single reference contains multiple parts, which must all match in different places in a read.
name,part1,part2
bc1,AAAA,GGGGAC
bc2,GGGG,ACGATG
refs = csv('my_reference.csv')
if read is [_ static1 ref.part1 static2 ref.part2 _]
for ref in refs => read.out!('{ref.name}.fa')