View the project on GitHub. jakob-schuster/matchbox
Functions are used to manipulate and transform data. Most of the time, users will be applying built-in matchbox functions. This list includes every built-in function with examples.
len:
Num
Calculates the length of a string.
Parameter | Description |
---|---|
s: Str | The string to calculate the length of. |
# get the average read length
read.seq.len().average!()
slice:
Str
Takes a slice of a string. Inclusive of start position, exclusive of end position.
Parameter | Description |
---|---|
s: Str | The string to slice. |
start: Num | The start position. |
end: Num | The end position. |
# equal to 'ell'
sliced_string = slice('hello', 1, 3)
tag:
Read
Copies a read and appends a string to the end of its description line. A space is added.
Parameter | Description |
---|---|
read: Read | The read to tag. |
tag: Str | The string to append to the read's description line. |
prefix: Str = ' ' | A string inserted between the existing description line and the new tag. |
# trim off the first 10 bases,
# and tag the read with their sequence
if read is [bc:|10| rest:_] =>
rest.tag('barcode={bc.seq}')
.out!('out.fa')
translate:
Str
Translates a string from nucleotide to protein sequence. Naively assumes that you've given it a string representing a valid sequence of nucleotides. When the input string contains an invalid codon (including if the input string contains characters aside from A
, C
, T
and G
), a '?'
character is produced.
Parameter | Description |
---|---|
seq: Str | The sequence to translate. |
seq = AGCCCTCCAGGACAGGCTGCATCAGAAGAG
# will translate to SPPGQAASEE
prot = translate(seq)
str_concat:
Str
Concatenates two strings.
Parameter | Description |
---|---|
s0: Str | The first string. |
s1: Str | The second string. |
s0 = 'hello'
s1 = 'world'
# all equivalent
a = str_concat(s0, s1)
b = '{s0}{s1}'
c = 'helloworld'
concat:
Read
Concatenates two reads together.
Parameter | Description |
---|---|
r1: Read | The first read. |
r2: Read | The second read. |
# locate a known sequence, and cut it out -
# concatenating everything before the sequence
# onto everything after it
if read is [before:_ AGCTAGTCG after:_] =>
before
.concat(after)
.out!('primer_excluded.fq')
# glue together the forward read1 and the reverse read2
read.r1.concat(-read.r2).out!('glued.fq')
csv:
[Record]
Opens a CSV and produces a list of records. The field names of each record correspond to the header names of the CSV, and the values correspond to the values found on each row. This processing occurs once, at the start of execution. The entire CSV is loaded into memory.
Parameter | Description |
---|---|
filename: Str | The CSV file to open. |
primer = AGCTAGTCGATGC
bcs = csv('my_barcodes.csv')
if read is [_ primer bc.sequence _] for bc in bcs =>
read.tag('barcode={bc.barcode_name}')
.out!('demultiplexed.fq')
tsv:
[Record]
Opens a TSV and produces a list of records. The field names of each record correspond to the header names of the TSV, and the values correspond to the values found on each row. This processing occurs once, at the start of execution. The entire TSV is loaded into memory.
Parameter | Description |
---|---|
filename: Str | The TSV file to open. |
primer = AGCTAGTCGATGC
bcs = tsv('my_barcodes.tsv')
if read is [_ primer bc.sequence _] for bc in bcs =>
read.tag('barcode={bc.barcode_name}')
.out!('demultiplexed.fq')
fasta:
[Read]
Opens a FASTA and produces a list of reads, each one containing seq
, id
and desc
fields. This processing occurs once, at the start of execution. The entire TSV is loaded into memory.
Parameter | Description |
---|---|
filename: Str | The TSV file to open. |
primer = AGCTAGTCGATGC
bcs = fasta('my_barcodes.fa')
if read is [_ primer bc.seq _] for bc in bcs =>
read.tag('barcode={bc.id}')
.out!('demultiplexed.fq')
find_first:
Num
Searches for a substring within a string. If the substring is present, returns the first 0-based position within the string where the substring could be found. If the substring is not present, returns -1
.
Parameter | Description |
---|---|
s0: Str | The string to search through. |
s1: Str | The substring to search for in s0 . |
# get the protein translation of the read
protein = read.seq.translate()
# find the first Cysteine amino acid in the sequence
location = protein.find_first('C')
# if a Cysteine could be found, trim the read,
# only keeping everything after the first Cys codon
if location != -1 =>
if read is [|location*3| rest:_] =>
rest.out!('trimmed.fq')
find_last:
Num
Searches for a substring within a string. If the substring is present, returns the last 0-based position within the string where the substring could be found. If the substring is not present, returns -1
.
Parameter | Description |
---|---|
s0: Str | The string to search through. |
s1: Str | The substring to search for in s0 . |
# get the protein translation of the read
protein = read.seq.translate()
# find the last Cysteine amino acid in the sequence
location = protein.find_last('C')
# if a Cysteine could be found, trim the read,
# only keeping everything before the last Cys codon
if location != -1 =>
if read is [rest:_ |location*3|] =>
rest.out!('trimmed.fq')
to_upper:
Str
Converts a string to upper-case. Non-alphabetic characters are unaffected.
Parameter | Description |
---|---|
s: Str | The string to convert to upper-case. |
loud_hello = 'hello'.to_upper()
to_lower:
Str
Converts a string to lower-case. Non-alphabetic characters are unaffected.
Parameter | Description |
---|---|
s: Str | The string to convert to lower-case. |
quiet_hello = 'HELLO'.to_lower()
describe:
Str
Searches for a set of sequences within a read's seq
field, and returns the pattern which most precisely describes the read, as a Str
. Very useful for exploring the arrangement of known primers and static regions in your data!
Each search term is searched within the read. If reverse_complement
is true, the reverse-complement sequences are also searched. Edit distance is allowed in proportion to the length of each search term sequence and the error
parameter. Any matches are then concatenated together to produce a pattern string which describes the read in terms of the searched sequences.
Parameter | Description |
---|---|
read: Read | The read to describe. |
search_terms: Record | The set of terms to search for. Each field of the struct must have a Str value, which represents the sequence to search for. Each field's name is used as a label for the sequence in the pattern. |
reverse_complement: Bool = false | Whether to additionally search for the reverse complement of each search term. |
error: Num = 0 | Error rate proportion to allow when searching for each search term. |
read.describe(
{ tso = AGCATGCTGATG, rtp = GATCGTACGTGTTG },
reverse_complement = true
) |> count!()
contains:
Bool
Checks whether a value is present in a list.
Parameter | Description |
---|---|
list: [Any] | The list to search through. |
val: Any | The value to search for. |
# we have a one-column CSV,
# containing the names of a subset of
# reads we're interested in
names = csv('names.csv')
# if the read is on the list,
# include it in the subset
if names.contains({name = read.id}) =>
read |> out!('subset.fq')
distance:
Num
Calculates the global edit distance between two strings.
Parameter | Description |
---|---|
s0: Str | The first string. |
s1: Str | The second string. |
# equal to 1
d = distance('cat', 'bat')
to_str:
Str
Converts any value to a Str
. Equivalent to formatting the value in a string literal (e.g. '{val}'
).
Parameter | Description |
---|---|
v: Any | The value to convert. |
# convert the length of the sequence (a Num) into a Str
length_str = read.seq.len().to_str()
# calculate the length of the Str
number_of_digits = length_str.len()
to_num:
Num
Parses a Str
into a Num
. When given a value that can't be parsed into a floating-point number, throws an error.
Parameter | Description |
---|---|
s: Str | The string to parse. |
'100'.to_num()
stdout!:
Effect
Prints any value directly to stdout
.
Parameter | Description |
---|---|
val: Any | The value to print. |
# print out each read's ID
stdout!(read.id)
out!:
Effect
Prints any value to a file. Based on the filename, the format of the output will be inferred. When the filetype denotes FASTA, FASTQ or SAM format, the value provided must be a read containing the following fields:
Input format | Fields | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FASTA ( .fa , .fasta ) |
|
||||||||||||
FASTQ ( .fq , .fastq ) |
|
||||||||||||
SAM ( .sam , .bam ) |
|
If the file is any other format, it is treated as a plain text file, and any values sent to it are simply printed directly and separated by newlines.
A file is only created when the first
Parameter | Description |
---|---|
val: Any | The value to print. |
filename: Str | The name of the file to write to. |
if read is [|10| rest:_] => rest.out!('trimmed.fq')
count!:
Effect
Collects all of the values sent to count!
across all of the reads. Tallies up the number of times each unique value is sent. At the end of execution, prints out a table of counts for each value.
Can be used for quantifying how many reads match certain criteria, or for counting occurrences of sequences such as barcodes. To generate multiple seperate counts tables from a single pass, the name
parameter can be used to identify different global counts variables. Each name
will correspond to a fresh table.
Parameter | Description |
---|---|
val: Any | The value to be counted. |
name: Str = 'default' | The name of the global counts matrix to store the value into. Multiple tables of counts can be generated, using different names. |
# count each read towards the 'total'
count!('total')
# for each read with a length over 1000,
# count it towards the 'long' total
if read.len() > 1000 => count!('long')
primer = AGCTAGCTGA
# all reads count to the 'total' in the 'read types' table
count!('total', name='read types')
# find the primer, and then count the
# occurrences of unique 10-bp sequences following it
if read is [_ primer bc:|10| _] => {
# reads with the primer are counted
# in the 'read types' table
count!('found primer', name='read types')
# unique barcodes are tracked by the 'barcodes' table
bc.seq.count!(name = 'barcodes')
}
average!:
Effect
Collects all of the numeric values sent to average!
across all of the reads. To simultaneously calculate multiple averages, the name
parameter can be used. At the end of execution, prints out the mean and variance.
To avoid having to store all the numeric values, variance is calculated using Welford's online algorithm.
Parameter | Description |
---|---|
num: Num | The number to contribute to the average. |
name: Str = 'default' | The name of the global variable to store the average into. Multiple averages can be calculated at once, using different names. |
# calculate the average length of all sequences
read.seq.len().average!()
# calculate average sequence length
read.seq.len().average!(name='sequence length')
primer = AGCTAGCTGA
# for reads that contain the primer,
# calculate the average position at which it occurs
if read is [before:_ primer _] =>
before.seq.len().average!(name='primer position')