Contents:
- Why I write and who you are
- Haskell - The language of terms
- Haskell - The language of types
- Haskell - Typeclasses
- Interlude: how typing simplifies development
- Parser combinators: Fast streaming of huge and complex files
- Big data: Algorithms and performance
under construction
Up to this point, I’ve barely considered performance. Now that will change. Up to know we’ve been dealing with “small” data and “fast” algorithms. In these cases, performance is something that we don’t even need to think about. Now we will explore “medium” and “big” data. By “medium” data, I mean data that is small enough to fit into memory on a cluster. A human genome, for instance, is just a few gigabytes and can be slurped safely into RAM if the memory representation is efficient. By “big” data, I mean files such as raw read data that may stretch to terabytes and will require streaming.
Medium data
The most common data format in all of bioinformatics is FASTA. This format is used to store DNA or protein sequence. It has the format:
>sequence 1
GATTACA
GATCAT
>sequence 2
GATTACA
GATCAT
The header describes the sequences. There is not accepted standard for the header format. In some cases, the first word in the header is the identifier and the remaining words are a description. For example, this Arabidopsis genome I retrieved from the TAIR database, begins as so:
>1 CHROMOSOME dumped from ADB: Feb/3/09 16:9; last updated: 2009-02-02
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCAT
GAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTT
Here the identifier “1” indicates we are looking at chromosome number 1.
Here we will load an Arabidopsis genome FASTA file into memory and benchmark the performance.
Let’s start by defining a Python program for comparison:
def read_fasta_fast(filename):
with open(filename, "r") as f:
= f.read()
txt = []
entries for entry in txt.split(">")[1:]:
= entry.split("\n")
lines 0], "".join(lines[1:])))
entries.append((lines[return entries
Here we slurp the entire file into memory, split it by record (‘>’ characters), split the records into lines (‘’ characters), and then merge the lists of lines into header/sequence pairs.
This function has no error handling and assumes UNIX style line ending syntax. Here is a more robust Python implementation:
def read_fasta_safe(filename):
= []
entries = None
header = []
sequences with open(filename, "r") as f:
for line in f.readlines():
= line.strip()
line if line[0] == ">":
if header:
"".join(sequences)))
entries.append((header, = line
header = []
sequences else:
if header:
sequences.append(line)else:
raise ValueError("Entry is missing a header")
if header:
"".join(sequences)))
entries.append((header, return entries
A similar parsing algorithm could be written in C++ as follows:
struct FastaEntry {
std::string header;
std::string sequence;
};
std::vector<FastaEntry> read_fasta(const std::string& filename) {
std::vector<FastaEntry> entries;
std::ifstream file(filename);
if (!file.is_open()) {
std::cerr << "Error opening file: " << filename << std::endl;
return entries;
}
std::string line;
;
FastaEntry current_entrybool in_sequence = false;
while (std::getline(file, line)) {
if (line.empty()) continue;
if (line[0] == '>') {
if (in_sequence) {
.push_back(std::move(current_entry));
entries= FastaEntry();
current_entry }
= true;
in_sequence
.header = line.substr(1, line.size() - 1);
current_entry} else {
.sequence.append(line);
current_entry}
}
if (in_sequence) {
.push_back(std::move(current_entry));
entries}
return entries;
}
A FASTA file may be parsed in haskell with a simple one-liner:
import Data.List.Split (split, startsWith)
parseFasta :: String -> [(String, String)]
= map ((\(h: xs) -> (h, concat xs)) . lines) . split (startsWith ">") parseFasta
We are using two general splitting functions from the Data.List.Split
library. split
is a general function for breaking a list into sublists. Here
the list is a String
, which is just a list of characters. The first argument
to the split
function is a splitting “strategy”. The strategy we will use here
is to split the list by prefix (“>”) and remove the prefix from the generated
records. We represent this strategy with startsWith ">"
. string.
Next we map a composition of two functions over these records. The first
function breaks the record into lines. The second function, represented as a
lambda expression, interprets the first line as a header and the remaining lines
as sequence entries. The sequence entries are concatenated together with
concat
.
When I compare these three implementations – Python, C++, and Haskell – I get
the following times: 0.474s, 0.164s, and 15s. Wow, why is Haskell so slow here?
The problem is that we are using the Haskell String
type. It is a linked list
of characters. So for every 1 byte character in the data, an 8 byte pointer is
required to reference the next element. That is quite dreadful. Also there is no
way to randomly access internal elements without traversing the list to their
location.
Fortunately, there are much faster alternatives to the Haskell String
type. These are ByteString
and Text
types. The former is good for data where
characters are all expected to be a single byte (i.e., ASCII format). The later
supports muti-byte characters.
We can implement the String algorithm in ByteString
type as follows:
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
import qualified Data.Text as T
parseFastaFast :: B.ByteString -> [(Header, Sequence)]
= map (\(h: xs) -> (h, B.concat xs)) . map C.lines . tail . C.split '>'
parseFastaFast
parseFastaFast :: T.Text -> [(T.Text, T.Text)]
= map (\(h:xs) -> (h, T.concat xs)) . map T.lines . tail . T.split ((==) '>') parseFastaFast
There are some minor differences in library implementations here, but these
parsing algorithms are essentially the same as for the String
type. For
splitting records, we here simply split on ‘>’ and then through away the
(usually empty) first element.
The bytestring function runs in 0.355s. Faster than Python but still only half the speed of the C++ function. The text function is much slower, running in 1.78s. The majority of this time, 1.478s is spent reading the file. In contrast, the bytstring data is loaded in only 0.21s.
The Haskell parser may also be written more robustly using Attoparsec (as described in the last post):
isSpace :: Word8 -> Bool
isSpace c = c <= 0x20
isNewline :: Word8 -> Bool
= c == 0x0a || c == 0x13
isNewline c
isStart :: Word8 -> Bool
= c == 0x3e
isStart c
parseFasta :: Parser [(B.ByteString, B.ByteString)]
= many' parseEntry <* endOfInput
parseFasta where
parseEntry :: Parser (Header, Sequence)
= do
parseEntry <- skip isStart
_ <- takeWhile1 (not . isNewline)
header <- skipWhile isSpace
_ sequence <- B.concat <$> many' seqLine
return (header, sequence)
seqLine :: Parser B.ByteString
= do
seqLine <- takeWhile1 (\c -> not (isStart c) && not (isSpace c))
line <- skipWhile isSpace
_ return line
This implementation runs in 0.540 seconds, about the same as the Python parser. It could further be easily generalized to take a header and sequence parser as arguments:
parseFasta :: Monoid seq => Parser header -> Parser seq -> Parser [(header, seq)]
= many' parseEntry <* endOfInput
parseFasta parseHeader parseSeq where
parseEntry :: Parser (Header, Sequence)
= do
parseEntry <- skip isStart
_ <- parseHeader
header <- skipWhile isSpace
_ sequence <- mconcat <$> sepBy (takeWhile1 (not . isNewLine)) parseSeq
return (header, sequence)
With this, we could define specialized parsers for different header formats and provide specialized parsers for different molecule types (e.g., DNA versus protein).
So far we’ve been reading the file into memory, copying blocks of memory from this original block into small pieces, and finally copying these small pieces into final strings. This is a lot of memory re-allocation. But it is possible to parse a file with a single block of allocated memory. Here is the C code:
// Define a linked list of header/sequence pairs
typedef struct FastaEntry {
char* header;
char* sequence;
struct FastaEntry* next;
} FastaEntry;
// Make a linked list of FASTA entries given a buffer that contains
// the full FASTA file with a length of size `size`.
* parse_fasta(char* buffer, const size_t size){
FastaEntry
// initial position in the file
size_t file_idx = 0;
// initial position in the fasta sequence
size_t fasta_idx = 0;
// initial fasta entry
* root = NULL;
FastaEntry* tip = root;
FastaEntry
// fastfoward to the first fasta entry
// usually this will be the first character
if(buffer[file_idx] != '>'){
// skip any initial comments or whitespace
while(buffer[file_idx] != '>'){
++;
file_idx}
++;
file_idx}
// walk through the buffer character by character
while(file_idx < size){
// start a new entry
if(buffer[file_idx] == '>'){
// allocate memory for a new entry
* next_tip = (FastaEntry*)malloc(sizeof(FastaEntry));
FastaEntryif (!next_tip) {
("Failed to allocate memory");
perrorreturn NULL;
}
->header = buffer + file_idx + 1;
next_tip
// link prior tip to new tip
->next = next_tip;
tip
// become the next tip
= next_tip;
tip
// go to last char before newline
while(file_idx < size && buffer[file_idx] != '\n'){
++;
file_idx}
if(file_idx < size){
// set newline to null
[file_idx] = '\0';
buffer}
// go to first char of sequence (assume '\n' line ends)
++;
file_idx// set sequence
->sequence = buffer + file_idx;
tip}
// else if we are in sequence ...
else {
= file_idx;
fasta_idx // loop to the next entry
while(file_idx < size && buffer[file_idx] != '>'){
// if not a newline, then swap set as next sequence character
if(buffer[file_idx] != '\n'){
[fasta_idx] = buffer[file_idx];
buffer++;
fasta_idx}
++;
file_idx}
// set newline before '>' to null
[fasta_idx] = '\0';
buffer}
}
// return the first element in the linked list
return root;
}