Big data: Algorithms and performance

by Zebulun Arendsee

Contents:

under construction

Up to this point, I’ve barely considered performance. Now that will change. Up to know we’ve been dealing with “small” data and “fast” algorithms. In these cases, performance is something that we don’t even need to think about. Now we will explore “medium” and “big” data. By “medium” data, I mean data that is small enough to fit into memory on a cluster. A human genome, for instance, is just a few gigabytes and can be slurped safely into RAM if the memory representation is efficient. By “big” data, I mean files such as raw read data that may stretch to terabytes and will require streaming.

Medium data

The most common data format in all of bioinformatics is FASTA. This format is used to store DNA or protein sequence. It has the format:

>sequence 1
GATTACA
GATCAT
>sequence 2
GATTACA
GATCAT

The header describes the sequences. There is not accepted standard for the header format. In some cases, the first word in the header is the identifier and the remaining words are a description. For example, this Arabidopsis genome I retrieved from the TAIR database, begins as so:

>1 CHROMOSOME dumped from ADB: Feb/3/09 16:9; last updated: 2009-02-02
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCAT
GAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTT

Here the identifier “1” indicates we are looking at chromosome number 1.

Here we will load an Arabidopsis genome FASTA file into memory and benchmark the performance.

Let’s start by defining a Python program for comparison:

def read_fasta_fast(filename):
    with open(filename, "r") as f: 
        txt = f.read()
        entries = []
        for entry in txt.split(">")[1:]:
            lines = entry.split("\n")
            entries.append((lines[0], "".join(lines[1:])))
    return entries

Here we slurp the entire file into memory, split it by record (‘>’ characters), split the records into lines (‘’ characters), and then merge the lists of lines into header/sequence pairs.

This function has no error handling and assumes UNIX style line ending syntax. Here is a more robust Python implementation:

def read_fasta_safe(filename):
    entries = []
    header = None
    sequences = []
    with open(filename, "r") as f: 
        for line in f.readlines():
            line = line.strip()
            if line[0] == ">":
                if header:
                    entries.append((header, "".join(sequences)))
                header = line
                sequences = []
            else:
                if header:
                    sequences.append(line)
                else:
                    raise ValueError("Entry is missing a header")
    if header:
        entries.append((header, "".join(sequences)))
    return entries

A similar parsing algorithm could be written in C++ as follows:

struct FastaEntry {
    std::string header;
    std::string sequence;
};

std::vector<FastaEntry> read_fasta(const std::string& filename) {
    std::vector<FastaEntry> entries;
    std::ifstream file(filename);
    if (!file.is_open()) {
        std::cerr << "Error opening file: " << filename << std::endl;
        return entries;
    }

    std::string line;
    FastaEntry current_entry;
    bool in_sequence = false;

    while (std::getline(file, line)) {
        if (line.empty()) continue;

        if (line[0] == '>') {
            if (in_sequence) {
                entries.push_back(std::move(current_entry));
                current_entry = FastaEntry();
            }
            in_sequence = true;

            current_entry.header = line.substr(1, line.size() - 1);
        } else {
            current_entry.sequence.append(line);
        }
    }

    if (in_sequence) {
        entries.push_back(std::move(current_entry));
    }

    return entries;
}

A FASTA file may be parsed in haskell with a simple one-liner:

import Data.List.Split (split, startsWith)

parseFasta :: String -> [(String, String)]
parseFasta = map ((\(h: xs) -> (h, concat xs)) . lines) . split (startsWith ">")

We are using two general splitting functions from the Data.List.Split library. split is a general function for breaking a list into sublists. Here the list is a String, which is just a list of characters. The first argument to the split function is a splitting “strategy”. The strategy we will use here is to split the list by prefix (“>”) and remove the prefix from the generated records. We represent this strategy with startsWith ">". string.

Next we map a composition of two functions over these records. The first function breaks the record into lines. The second function, represented as a lambda expression, interprets the first line as a header and the remaining lines as sequence entries. The sequence entries are concatenated together with concat.

When I compare these three implementations – Python, C++, and Haskell – I get the following times: 0.474s, 0.164s, and 15s. Wow, why is Haskell so slow here? The problem is that we are using the Haskell String type. It is a linked list of characters. So for every 1 byte character in the data, an 8 byte pointer is required to reference the next element. That is quite dreadful. Also there is no way to randomly access internal elements without traversing the list to their location.

Fortunately, there are much faster alternatives to the Haskell String type. These are ByteString and Text types. The former is good for data where characters are all expected to be a single byte (i.e., ASCII format). The later supports muti-byte characters.

We can implement the String algorithm in ByteString type as follows:

import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
import qualified Data.Text as T

parseFastaFast :: B.ByteString -> [(Header, Sequence)]
parseFastaFast = map (\(h: xs) -> (h, B.concat xs)) . map C.lines . tail . C.split '>'

parseFastaFast :: T.Text -> [(T.Text, T.Text)]
parseFastaFast = map (\(h:xs) -> (h, T.concat xs)) . map T.lines . tail . T.split ((==) '>')

There are some minor differences in library implementations here, but these parsing algorithms are essentially the same as for the String type. For splitting records, we here simply split on ‘>’ and then through away the (usually empty) first element.

The bytestring function runs in 0.355s. Faster than Python but still only half the speed of the C++ function. The text function is much slower, running in 1.78s. The majority of this time, 1.478s is spent reading the file. In contrast, the bytstring data is loaded in only 0.21s.

The Haskell parser may also be written more robustly using Attoparsec (as described in the last post):

isSpace :: Word8 -> Bool
isSpace c = c <= 0x20

isNewline :: Word8 -> Bool
isNewline c = c == 0x0a || c == 0x13

isStart :: Word8 -> Bool
isStart c = c == 0x3e

parseFasta :: Parser [(B.ByteString, B.ByteString)]
parseFasta = many' parseEntry <* endOfInput
  where
  parseEntry :: Parser (Header, Sequence)
  parseEntry = do
    _ <- skip isStart
    header <- takeWhile1 (not . isNewline)
    _ <- skipWhile isSpace
    sequence <- B.concat <$> many' seqLine
    return (header, sequence)

  seqLine :: Parser B.ByteString
  seqLine = do
    line <- takeWhile1 (\c -> not (isStart c) && not (isSpace c))
    _ <- skipWhile isSpace
    return line

This implementation runs in 0.540 seconds, about the same as the Python parser. It could further be easily generalized to take a header and sequence parser as arguments:

parseFasta :: Monoid seq => Parser header -> Parser seq -> Parser [(header, seq)]
parseFasta parseHeader parseSeq = many' parseEntry <* endOfInput
  where
  parseEntry :: Parser (Header, Sequence)
  parseEntry = do
    _ <- skip isStart
    header <- parseHeader
    _ <- skipWhile isSpace
    sequence <- mconcat <$> sepBy (takeWhile1 (not . isNewLine)) parseSeq
    return (header, sequence)

With this, we could define specialized parsers for different header formats and provide specialized parsers for different molecule types (e.g., DNA versus protein).

So far we’ve been reading the file into memory, copying blocks of memory from this original block into small pieces, and finally copying these small pieces into final strings. This is a lot of memory re-allocation. But it is possible to parse a file with a single block of allocated memory. Here is the C code:

// Define a linked list of header/sequence pairs
typedef struct FastaEntry {
    char* header;
    char* sequence;
    struct FastaEntry* next;
}  FastaEntry;

// Make a linked list of FASTA entries given a buffer that contains
// the full FASTA file with a length of size `size`.
FastaEntry* parse_fasta(char* buffer, const size_t size){

  // initial position in the file
  size_t file_idx = 0;

  // initial position in the fasta sequence
  size_t fasta_idx = 0;

  // initial fasta entry
  FastaEntry* root = NULL;
  FastaEntry* tip = root;

  // fastfoward to the first fasta entry
  // usually this will be the first character
  if(buffer[file_idx] != '>'){
    // skip any initial comments or whitespace
    while(buffer[file_idx] != '>'){
      file_idx++;
    }
    file_idx++;
  }

  // walk through the buffer character by character
  while(file_idx < size){

    // start a new entry
    if(buffer[file_idx] == '>'){
      // allocate memory for a new entry
      FastaEntry* next_tip = (FastaEntry*)malloc(sizeof(FastaEntry));
      if (!next_tip) {
          perror("Failed to allocate memory");
          return NULL;
      }
      next_tip->header = buffer + file_idx + 1;

      // link prior tip to new tip
      tip->next = next_tip;

      // become the next tip
      tip = next_tip;

      // go to last char before newline
      while(file_idx < size && buffer[file_idx] != '\n'){
        file_idx++;
      }
      if(file_idx < size){
        // set newline to null
        buffer[file_idx] = '\0';
      }
      // go to first char of sequence (assume '\n' line ends)
      file_idx++;
      // set sequence
      tip->sequence = buffer + file_idx;
    }
    // else if we are in sequence ...
    else {
      fasta_idx = file_idx;
      // loop to the next entry
      while(file_idx < size && buffer[file_idx] != '>'){
        // if not a newline, then swap set as next sequence character
        if(buffer[file_idx] != '\n'){
          buffer[fasta_idx] = buffer[file_idx];
          fasta_idx++; 
        }
        file_idx++;
      }
      // set newline before '>' to null
      buffer[fasta_idx] = '\0';
    }
  }

  // return the first element in the linked list
  return root;
}
built on 2024-06-03 01:05:44.888648215 UTC from file 2024-05-30-performance