View the project on GitHub. jakob-schuster/matchbox

Navigation

Function list

Functions are used to manipulate and transform data. Most of the time, users will be applying built-in matchbox functions. This list includes every built-in function with examples.




len: Num

Calculates the length of a string.

Parameter Description
s: StrThe string to calculate the length of.
# get the average read length
read.seq.len().average!()



slice: Str

Takes a slice of a string. Inclusive of start position, exclusive of end position.

Parameter Description
s: StrThe string to slice.
start: NumThe start position.
end: NumThe end position.
# equal to 'ell'
sliced_string = slice('hello', 1, 3)



tag: Read

Copies a read and appends a string to the end of its description line. A space is added.

Parameter Description
read: ReadThe read to tag.
tag: StrThe string to append to the read's description line.
prefix: Str = ' 'A string inserted between the existing description line and the new tag.
# trim off the first 10 bases,
# and tag the read with their sequence
if read is [bc:|10| rest:_] =>
    rest.tag('barcode={bc.seq}')
        .out!('out.fa')



translate: Str

Translates a string from nucleotide to protein sequence. Naively assumes that you've given it a string representing a valid sequence of nucleotides. When the input string contains an invalid codon (including if the input string contains characters aside from A, C, T and G), a '?' character is produced.

Parameter Description
seq: StrThe sequence to translate.
seq = AGCCCTCCAGGACAGGCTGCATCAGAAGAG
# will translate to SPPGQAASEE
prot = translate(seq)



str_concat: Str

Concatenates two strings.

Parameter Description
s0: StrThe first string.
s1: StrThe second string.
s0 = 'hello'
s1 = 'world'

# all equivalent
a = str_concat(s0, s1)
b = '{s0}{s1}'
c = 'helloworld'



concat: Read

Concatenates two reads together.

Parameter Description
r1: ReadThe first read.
r2: ReadThe second read.
# locate a known sequence, and cut it out -
# concatenating everything before the sequence
# onto everything after it
if read is [before:_ AGCTAGTCG after:_] => 
    before
        .concat(after)
        .out!('primer_excluded.fq')
# glue together the forward read1 and the reverse read2
read.r1.concat(-read.r2).out!('glued.fq')



csv: [Record]

Opens a CSV and produces a list of records. The field names of each record correspond to the header names of the CSV, and the values correspond to the values found on each row. This processing occurs once, at the start of execution. The entire CSV is loaded into memory.

Parameter Description
filename: StrThe CSV file to open.
primer = AGCTAGTCGATGC
bcs = csv('my_barcodes.csv')

if read is [_ primer bc.sequence _] for bc in bcs =>
    read.tag('barcode={bc.barcode_name}')
        .out!('demultiplexed.fq')



tsv: [Record]

Opens a TSV and produces a list of records. The field names of each record correspond to the header names of the TSV, and the values correspond to the values found on each row. This processing occurs once, at the start of execution. The entire TSV is loaded into memory.

Parameter Description
filename: StrThe TSV file to open.
primer = AGCTAGTCGATGC
bcs = tsv('my_barcodes.tsv')

if read is [_ primer bc.sequence _] for bc in bcs =>
    read.tag('barcode={bc.barcode_name}')
        .out!('demultiplexed.fq')



fasta: [Read]

Opens a FASTA and produces a list of reads, each one containing seq, id and desc fields. This processing occurs once, at the start of execution. The entire TSV is loaded into memory.

Parameter Description
filename: StrThe TSV file to open.
primer = AGCTAGTCGATGC
bcs = fasta('my_barcodes.fa')

if read is [_ primer bc.seq _] for bc in bcs =>
    read.tag('barcode={bc.id}')
        .out!('demultiplexed.fq')



find_first: Num

Searches for a substring within a string. If the substring is present, returns the first 0-based position within the string where the substring could be found. If the substring is not present, returns -1.

Parameter Description
s0: StrThe string to search through.
s1: StrThe substring to search for in s0.
# get the protein translation of the read
protein = read.seq.translate()
# find the first Cysteine amino acid in the sequence
location = protein.find_first('C')
# if a Cysteine could be found, trim the read,
# only keeping everything after the first Cys codon
if location != -1 =>
    if read is [|location*3| rest:_] =>
        rest.out!('trimmed.fq')



find_last: Num

Searches for a substring within a string. If the substring is present, returns the last 0-based position within the string where the substring could be found. If the substring is not present, returns -1.

Parameter Description
s0: StrThe string to search through.
s1: StrThe substring to search for in s0.
# get the protein translation of the read
protein = read.seq.translate()
# find the last Cysteine amino acid in the sequence
location = protein.find_last('C')
# if a Cysteine could be found, trim the read,
# only keeping everything before the last Cys codon
if location != -1 =>
    if read is [rest:_ |location*3|] =>
        rest.out!('trimmed.fq')



to_upper: Str

Converts a string to upper-case. Non-alphabetic characters are unaffected.

Parameter Description
s: StrThe string to convert to upper-case.
loud_hello = 'hello'.to_upper()



to_lower: Str

Converts a string to lower-case. Non-alphabetic characters are unaffected.

Parameter Description
s: StrThe string to convert to lower-case.
quiet_hello = 'HELLO'.to_lower()



describe: Str

Searches for a set of sequences within a read's seq field, and returns the pattern which most precisely describes the read, as a Str. Very useful for exploring the arrangement of known primers and static regions in your data!

Each search term is searched within the read. If reverse_complement is true, the reverse-complement sequences are also searched. Edit distance is allowed in proportion to the length of each search term sequence and the error parameter. Any matches are then concatenated together to produce a pattern string which describes the read in terms of the searched sequences.

Parameter Description
read: ReadThe read to describe.
search_terms: RecordThe set of terms to search for. Each field of the struct must have a Str value, which represents the sequence to search for. Each field's name is used as a label for the sequence in the pattern.
reverse_complement: Bool = falseWhether to additionally search for the reverse complement of each search term.
error: Num = 0Error rate proportion to allow when searching for each search term.
read.describe(
    { tso = AGCATGCTGATG, rtp = GATCGTACGTGTTG },
    reverse_complement = true
) |> count!()



contains: Bool

Checks whether a value is present in a list.

Parameter Description
list: [Any]The list to search through.
val: AnyThe value to search for.
# we have a one-column CSV, 
# containing the names of a subset of 
# reads we're interested in
names = csv('names.csv')

# if the read is on the list,
# include it in the subset
if names.contains({name = read.id}) =>
    read |> out!('subset.fq')



distance: Num

Calculates the global edit distance between two strings.

Parameter Description
s0: StrThe first string.
s1: StrThe second string.
# equal to 1
d = distance('cat', 'bat')



to_str: Str

Converts any value to a Str. Equivalent to formatting the value in a string literal (e.g. '{val}').

Parameter Description
v: AnyThe value to convert.
# convert the length of the sequence (a Num) into a Str
length_str = read.seq.len().to_str()
# calculate the length of the Str
number_of_digits = length_str.len()



to_num: Num

Parses a Str into a Num. When given a value that can't be parsed into a floating-point number, throws an error.

Parameter Description
s: StrThe string to parse.
'100'.to_num()



stdout!: Effect

Prints any value directly to stdout.

Parameter Description
val: AnyThe value to print.
# print out each read's ID
stdout!(read.id)



out!: Effect

Prints any value to a file. Based on the filename, the format of the output will be inferred. When the filetype denotes FASTA, FASTQ or SAM format, the value provided must be a read containing the following fields:

Input format Fields
FASTA
(.fa, .fasta)
seq: Str
id: Str
desc: Str
FASTQ
(.fq, .fastq)
seq: Str
id: Str
desc: Str
qual: Str
SAM
(.sam, .bam)
id | qname: Str
flag: Num
rname: Str
pos: Num
mapq: Num
cigar: Str
rnext: Str
pnext: Num
tlen: Num
seq: Str
qual: Str
desc: Str

If the file is any other format, it is treated as a plain text file, and any values sent to it are simply printed directly and separated by newlines.

A file is only created when the first

Parameter Description
val: AnyThe value to print.
filename: StrThe name of the file to write to.
if read is [|10| rest:_] => rest.out!('trimmed.fq')



count!: Effect

Collects all of the values sent to count! across all of the reads. Tallies up the number of times each unique value is sent. At the end of execution, prints out a table of counts for each value.

Can be used for quantifying how many reads match certain criteria, or for counting occurrences of sequences such as barcodes. To generate multiple seperate counts tables from a single pass, the name parameter can be used to identify different global counts variables. Each name will correspond to a fresh table.

Parameter Description
val: AnyThe value to be counted.
name: Str = 'default'The name of the global counts matrix to store the value into. Multiple tables of counts can be generated, using different names.
# count each read towards the 'total'
count!('total')

# for each read with a length over 1000,
# count it towards the 'long' total
if read.len() > 1000 => count!('long')
primer = AGCTAGCTGA

# all reads count to the 'total' in the 'read types' table
count!('total', name='read types')

# find the primer, and then count the 
# occurrences of unique 10-bp sequences following it
if read is [_ primer bc:|10| _] => {
    # reads with the primer are counted
    # in the 'read types' table
    count!('found primer', name='read types')

    # unique barcodes are tracked by the 'barcodes' table
    bc.seq.count!(name = 'barcodes')
}



average!: Effect

Collects all of the numeric values sent to average! across all of the reads. To simultaneously calculate multiple averages, the name parameter can be used. At the end of execution, prints out the mean and variance.

To avoid having to store all the numeric values, variance is calculated using Welford's online algorithm.

Parameter Description
num: NumThe number to contribute to the average.
name: Str = 'default'The name of the global variable to store the average into. Multiple averages can be calculated at once, using different names.
# calculate the average length of all sequences
read.seq.len().average!()
# calculate average sequence length
read.seq.len().average!(name='sequence length')

primer = AGCTAGCTGA

# for reads that contain the primer,
# calculate the average position at which it occurs
if read is [before:_ primer _] =>
    before.seq.len().average!(name='primer position')