View the project on GitHub. jakob-schuster/matchbox
Often, we have known sequences such as primers that we want to search for in our reads.
primer = AGCTAGCTGATGCTAGCTGG
if read is
[_ primer _] => count!('found primer')
[_] => count!('other')
info | The global error rate parameter will be used when searching for sequences. In this case, an error rate of 0.2 with this 20-bp primer sequence will allow a Levenshtein distance of 4 bp. |
If we also want to search for the reverse complement of the primer:
primer = AGCTAGCTGATGCTAGCTGG
if read is
[_ primer _] => count!('found primer')
[_ -primer _] => count!('found reverse primer')
[_] => count!('other')
If we want to filter our reads into different files:
primer = AGCTAGCTGATGCTAGCTGG
if read is
[_ primer _] => read.out!('primer.fq')
[_ -primer _] => read.out!('rev_primer.fq')
[_] => read.out!('other.fq')
info |
If you're expecting FASTQ output, like in the line out!('primer.fq') , then you can't provide FASTA reads as input, because they don't have a quality score. Similarly, SAM output can only be produced from SAM input, since they carry more metadata than FASTQ.
|
Often, we have multiple known sequences we want to search for.
primer1 = AGCTAGCTGATGCTAGCTGG
primer2 = AGCTAGCTGATGCTAGCTGG
if read is
[_ primer1 _ primer2 _] => count!('both primers')
[_] => count!('one or neither')
Instead of writing out patterns for each primer, we can use the describe
function to get counts for each arrangement of the primers in the data:
read.describe({
primer1 = AGCTAGCTGATGCTAGCTGG
primer2 = AGCTAGCTGATGCTAGCTGG
}).count!()
The optional arguments reverse_complement
and error
can be used to customise this search:
primers = {
primer1 = AGCTAGCTGATGCTAGCTGG
primer2 = AGCTAGCTGATGCTAGCTGG
}
# search only for exact matches, and search for both
# forward and reverse-complement sequences
read.describe(
primers, reverse_complement = true, error = 0.0
).count!()