View the project on GitHub. jakob-schuster/matchbox

Navigation

Patterns

matchbox's key feature is the ability to pattern-match on sequences, while tolerating mismatches and extracting particular slices of reads.

With pattern matching, matchbox can be used to filter, trim or demultiplex reads.


Branches

Pattern-matching takes place in the branches of an if .. is statement:

# trim off polyA tails
if read is 
    [_ AAAAAAAA after:_] => after.out!('trimmed.fa')
    [_] => read.out!('no_polya.fa')

Branches are separated by semicolons or newlines. Each branch is tried in order, until one is successful. Only the first successful branch will be executed.

info When a pattern matches multiple places in a single read (e.g. if the pattern above is used, and a read contains multiple polyA regions), then the body of the branch is executed for all instances of the match. This may cause the same read to produce multiple outputs.

If you want to trigger multiple statements from a single branch, use curly braces:

# trim off polyA tails
if read is 
    [_ AAAAAAAA after:_] => {
        after.out!('trimmed.fa')
        count!('polyA reads')
    }
    [_] => count!('other reads')

Patterns and regions

A pattern contains a series of regions enclosed in square brackets []. Patterns are comprised of several basic regions:

Name Syntax Description
Wild _ Matches any number of any nucleotides.
Fixed-length |n|

|n:r|
Matches exactly n nucleotides, where n is an expression of type Num.

Optionally, may contain another region r inside it.
Known sequence s Matches a known sequence s, where s is an expression of type Str. This may be a sequence literal such as AAAAAAAA, or a variable name such as primer.

Allows d base pairs of edit distance, where d is the length of n times the global error rate.
Named a:r Matches against the inner region r, and assigns the name a to refer to the matched slice of the read.
Grouped (r*) A sub-pattern, consisting of a series of regions.

The whole pattern acts like a schematic for the read, representing the whole read from left to right.

Here are some examples of patterns, and how to read them:

Example Description
[_] Any read, of any length.
[|10|] A read of exactly 10 bp.
[|10| _] A read with at least 10 bp at the start.
[first:|10| _] A read with at least 10 bp at the start. Extract these first 10 bp, and call the slice first.
[first:|10| rest:_] A read with at least 10 bp at the start. Extract these first 10 bp, and call the slice first. Extract the rest of the read, and call it rest.
[_ AAAAAAAAAA _] A read containing the sequence AAAAAAAAAA.
[fst:_ AAAAAAAAAA _] A read containing the sequence AAAAAAAAAA. Extract the bases before the sequence, and call them fst.
[_ AAAA _ TTTT _] A read containing the sequence AAAA, and later, TTTT.
[_ AAAA mid:_ TTTT _] A read containing the sequence AAAA, and later, TTTT. Take the bases in between and call the slice mid.
[|40:(_ AAAA _)| _] A read with at least 40 bp, which contains the sequence AAAA in the first 40 bp.
[|40:(_ AAAA x:_)| _] A read with at least 40 bp, which contains the sequence AAAA in the first 40 bp. Extract the remaining bases from after the sequence AAAA up to the 40th base, and call the slice x.
info A fixed-length pattern |n| cannot be placed between two wild regions. It must be anchored on at least one side by either end of the read, or by a known sequence. Otherwise, a pattern like [_ |10| _] could be used to denote "all possible selections of 10 bp from the read". Currently, matchbox rejects this as it would be computationally expensive and probably confusing.

Parameters using for

Patterns can also be parameterised, useful when demultiplexing or matching against a list of known sequences.

primer = AGCTAGTCGATGC
bcs = fasta('my_barcodes.fa')

if read is [_ primer bc.seq _] for bc in bcs =>
    read.tag('barcode={bc.id}')
        .out!('demultiplexed.fq')

Using the syntax for bc in bcs, a name bc is bound to a single value from the list of values bcs. The name bc can then be used in the body of the branch.

info When multiple values from the list could be used to satisfy the pattern (e.g. when multiple barcodes match in a read), the body of the branch is executed for all values that satisfy the pattern.