Function list

Functions are used to manipulate and transform data. Most of the time, users will be applying built-in matchbox functions. This list includes every built-in function with examples.

`len:` `Num`

Calculates the length of a string.

Parameter	Description
`s:` `Str`	The string to calculate the length of.

# get the average read length
read.seq.len().average!()

`slice:` `Str`

Takes a slice of a string. Inclusive of start position, exclusive of end position.

Parameter	Description
`s:` `Str`	The string to slice.
`start:` `Num`	The start position.
`end:` `Num`	The end position.

# equal to 'ell'
sliced_string = slice('hello', 1, 3)

`tag:` `Read`

Copies a read and appends a string to the end of its description line. A space is added.

Parameter	Description
`read:` `Read`	The read to tag.
`tag:` `Str`	The string to append to the read's description line.
`prefix:` `Str` `= ' '`	A string inserted between the existing description line and the new tag.

# trim off the first 10 bases,
# and tag the read with their sequence
if read is [bc:|10| rest:_] =>
    rest.tag('barcode={bc.seq}')
        .out!('out.fa')

`translate:` `Str`

Translates a string from nucleotide to protein sequence. Naively assumes that you've given it a string representing a valid sequence of nucleotides. Stop codons are represented as hyphen characters (-). When the input string contains an invalid codon (i.e. when the input string contains characters aside from A, C, T and G), a ? character is produced.

Parameter	Description
`seq:` `Str`	The sequence to translate.

seq = AGCCCTCCAGGACAGGCTGCATCAGAAGAG
# will translate to SPPGQAASEE
prot = translate(seq)

`str_concat:` `Str`

Concatenates two strings.

Parameter	Description
`s0:` `Str`	The first string.
`s1:` `Str`	The second string.

s0 = 'hello'
s1 = 'world'

# all equivalent
a = str_concat(s0, s1)
b = '{s0}{s1}'
c = 'helloworld'

`concat:` `Read`

Concatenates two reads together.

Parameter	Description
`r1:` `Read`	The first read.
`r2:` `Read`	The second read.

# locate a known sequence, and cut it out -
# concatenating everything before the sequence
# onto everything after it
if read is [before:_ AGCTAGTCG after:_] => 
    before
        .concat(after)
        .out!('primer_excluded.fq')

# glue together the forward read1 and the reverse read2
read.r1.concat(-read.r2).out!('glued.fq')

`csv:` `[Record]`

Opens a CSV and produces a list of records. The field names of each record correspond to the header names of the CSV, and the values correspond to the values found on each row. This processing occurs once, at the start of execution. The entire CSV is loaded into memory.

Parameter	Description
`filename:` `Str`	The CSV file to open.

primer = AGCTAGTCGATGC
bcs = csv('my_barcodes.csv')

if read is [_ primer bc.sequence _] for bc in bcs =>
    read.tag('barcode={bc.barcode_name}')
        .out!('demultiplexed.fq')

`tsv:` `[Record]`

Opens a TSV and produces a list of records. The field names of each record correspond to the header names of the TSV, and the values correspond to the values found on each row. This processing occurs once, at the start of execution. The entire TSV is loaded into memory.

Parameter	Description
`filename:` `Str`	The TSV file to open.

primer = AGCTAGTCGATGC
bcs = tsv('my_barcodes.tsv')

if read is [_ primer bc.sequence _] for bc in bcs =>
    read.tag('barcode={bc.barcode_name}')
        .out!('demultiplexed.fq')

`fasta:` `[Read]`

Opens a FASTA and produces a list of reads, each one containing seq, id and desc fields. This processing occurs once, at the start of execution. The entire TSV is loaded into memory.

Parameter	Description
`filename:` `Str`	The TSV file to open.

primer = AGCTAGTCGATGC
bcs = fasta('my_barcodes.fa')

if read is [_ primer bc.seq _] for bc in bcs =>
    read.tag('barcode={bc.id}')
        .out!('demultiplexed.fq')

`find_first:` `Num`

Searches for a substring within a string. If the substring is present, returns the first 0-based position within the string where the substring could be found. If the substring is not present, returns -1.

Parameter	Description
`s0:` `Str`	The string to search through.
`s1:` `Str`	The substring to search for in `s0`.

# get the protein translation of the read
protein = read.seq.translate()
# find the first Cysteine amino acid in the sequence
location = protein.find_first('C')
# if a Cysteine could be found, trim the read,
# only keeping everything after the first Cys codon
if location != -1 =>
    if read is [|location*3| rest:_] =>
        rest.out!('trimmed.fq')

`find_last:` `Num`

Searches for a substring within a string. If the substring is present, returns the last 0-based position within the string where the substring could be found. If the substring is not present, returns -1.

Parameter	Description
`s0:` `Str`	The string to search through.
`s1:` `Str`	The substring to search for in `s0`.

# get the protein translation of the read
protein = read.seq.translate()
# find the last Cysteine amino acid in the sequence
location = protein.find_last('C')
# if a Cysteine could be found, trim the read,
# only keeping everything before the last Cys codon
if location != -1 =>
    if read is [rest:_ |location*3|] =>
        rest.out!('trimmed.fq')

`to_upper:` `Str`

Converts a string to upper-case. Non-alphabetic characters are unaffected.

Parameter	Description
`s:` `Str`	The string to convert to upper-case.

loud_hello = 'hello'.to_upper()

`to_lower:` `Str`

Converts a string to lower-case. Non-alphabetic characters are unaffected.

Parameter	Description
`s:` `Str`	The string to convert to lower-case.

quiet_hello = 'HELLO'.to_lower()

`describe:` `Str`

Searches for a set of sequences within a read's seq field, and returns the pattern which most precisely describes the read, as a Str. Very useful for exploring the arrangement of known primers and static regions in your data!

Each search term is searched within the read. If reverse_complement is true, the reverse-complement sequences are also searched. Edit distance is allowed in proportion to the length of each search term sequence and the error parameter. Any matches are then concatenated together to produce a pattern string which describes the read in terms of the searched sequences.

Parameter	Description
`read:` `Read`	The read to describe.
`search_terms:` `Record`	The set of terms to search for. Each field of the struct must have a `Str` value, which represents the sequence to search for. Each field's name is used as a label for the sequence in the pattern.
`reverse_complement:` `Bool` `= false`	Whether to additionally search for the reverse complement of each search term.
`error:` `Num` `= 0`	Error rate proportion to allow when searching for each search term.

read.describe(
    { tso = AGCATGCTGATG, rtp = GATCGTACGTGTTG },
    reverse_complement = true
) |> count!()

`contains:` `Bool`

Checks whether a value is present in a list.

Parameter	Description
`list:` `[Any]`	The list to search through.
`val:` `Any`	The value to search for.

# we have a one-column CSV, 
# containing the names of a subset of 
# reads we're interested in
names = csv('names.csv')

# if the read is on the list,
# include it in the subset
if names.contains({name = read.id}) =>
    read |> out!('subset.fq')

`distance:` `Num`

Calculates the global edit distance between two strings.

Parameter	Description
`s0:` `Str`	The first string.
`s1:` `Str`	The second string.

# equal to 1
d = distance('cat', 'bat')

`to_str:` `Str`

Converts any value to a Str. Equivalent to formatting the value in a string literal (e.g. '{val}').

Parameter	Description
`v:` `Any`	The value to convert.

# convert the length of the sequence (a Num) into a Str
length_str = read.seq.len().to_str()
# calculate the length of the Str
number_of_digits = length_str.len()

`to_num:` `Num`

Parses a Str into a Num. When given a value that can't be parsed into a floating-point number, throws an error.

Parameter	Description
`s:` `Str`	The string to parse.

'100'.to_num()

`stdout!:` `Effect`

Prints any value directly to stdout.

Parameter	Description
`val:` `Any`	The value to print.

# print out each read's ID
stdout!(read.id)

`out!:` `Effect`

Prints any value to a file. Based on the filename, the format of the output will be inferred. When the filetype denotes FASTA, FASTQ or SAM format, the value provided must be a read containing the following fields:

Input format Fields

FASTA
(.fa, .fasta)

seq: Str

id: Str

desc: Str

FASTQ
(.fq, .fastq)

seq: Str

id: Str

desc: Str

qual: Str

SAM
(.sam, .bam)

id | qname: Str

flag: Num

rname: Str

pos: Num

mapq: Num

cigar: Str

rnext: Str

pnext: Num

tlen: Num

seq: Str

qual: Str

desc: Str

If the file is any other format, it is treated as a plain text file, and any values sent to it are simply printed directly and separated by newlines.

A file is only created when the first

Parameter	Description
`val:` `Any`	The value to print.
`filename:` `Str`	The name of the file to write to.

if read is [|10| rest:_] => rest.out!('trimmed.fq')

`count!:` `Effect`

Collects all of the values sent to count! across all of the reads. Tallies up the number of times each unique value is sent. At the end of execution, prints out a table of counts for each value.

Can be used for quantifying how many reads match certain criteria, or for counting occurrences of sequences such as barcodes. To generate multiple seperate counts tables from a single pass, the name parameter can be used to identify different global counts variables. Each name will correspond to a fresh table.

Parameter	Description
`val:` `Any`	The value to be counted.
`name:` `Str` `= 'default'`	The name of the global counts matrix to store the value into. Multiple tables of counts can be generated, using different names.

# count each read towards the 'total'
count!('total')

# for each read with a length over 1000,
# count it towards the 'long' total
if read.len() > 1000 => count!('long')

primer = AGCTAGCTGA

# all reads count to the 'total' in the 'read types' table
count!('total', name='read types')

# find the primer, and then count the 
# occurrences of unique 10-bp sequences following it
if read is [_ primer bc:|10| _] => {
    # reads with the primer are counted
    # in the 'read types' table
    count!('found primer', name='read types')

    # unique barcodes are tracked by the 'barcodes' table
    bc.seq.count!(name = 'barcodes')
}

`average!:` `Effect`

Collects all of the numeric values sent to average! across all of the reads. To simultaneously calculate multiple averages, the name parameter can be used. At the end of execution, prints out the mean and variance.

To avoid having to store all the numeric values, variance is calculated using Welford's online algorithm.

Parameter	Description
`num:` `Num`	The number to contribute to the average.
`name:` `Str` `= 'default'`	The name of the global variable to store the average into. Multiple averages can be calculated at once, using different names.

# calculate the average length of all sequences
read.seq.len().average!()

# calculate average sequence length
read.seq.len().average!(name='sequence length')

primer = AGCTAGCTGA

# for reads that contain the primer,
# calculate the average position at which it occurs
if read is [before:_ primer _] =>
    before.seq.len().average!(name='primer position')

`min:` `Num`

Takes the minimum of a list of numbers. When supplied with an empty list, throws an error.

Parameter	Description
`l:` `[Num]`	The list to compute the minimum of.

# calculate the minimum quality score in a read
read.qual.to_qscores().min() |> average!()

`max:` `Num`

Takes the maximum of a list of numbers. When supplied with an empty list, throws an error.

Parameter	Description
`l:` `[Num]`	The list to compute the maximum of.

# calculate the maximum quality score in a read
read.qual.to_qscores().max() |> average!()

`mean:` `Num`

Takes the mean of a list of numbers. When supplied with an empty list, throws an error.

Parameter	Description
`l:` `[Num]`	The list to compute the mean of.

# calculate the mean quality score in a read
read.qual.to_qscores().mean() |> average!()

`to_qscores:` `[Num]`

Takes a string of Phred quality scores and converts each character to its corresponding quality score (0-40). When supplied with a character outside the expected range of Phred characters, throws an error.

Parameter	Description
`qual:` `Str`	The quality string to convert.

# calculate the mean quality score in a read
read.qual.to_qscores().mean() |> average!()

Function list

len: Num

slice: Str

tag: Read

translate: Str

str_concat: Str

concat: Read

csv: [Record]

tsv: [Record]

fasta: [Read]

find_first: Num

find_last: Num

to_upper: Str

to_lower: Str

describe: Str

contains: Bool

distance: Num

to_str: Str

to_num: Num

stdout!: Effect

out!: Effect

count!: Effect

average!: Effect

min: Num

max: Num

mean: Num

to_qscores: [Num]

`len:` `Num`

`slice:` `Str`

`tag:` `Read`

`translate:` `Str`

`str_concat:` `Str`

`concat:` `Read`

`csv:` `[Record]`

`tsv:` `[Record]`

`fasta:` `[Read]`

`find_first:` `Num`

`find_last:` `Num`

`to_upper:` `Str`

`to_lower:` `Str`

`describe:` `Str`

`contains:` `Bool`

`distance:` `Num`

`to_str:` `Str`

`to_num:` `Num`

`stdout!:` `Effect`

`out!:` `Effect`

`count!:` `Effect`

`average!:` `Effect`

`min:` `Num`

`max:` `Num`

`mean:` `Num`

`to_qscores:` `[Num]`