Parser combinators: Fast streaming of huge and complex files

by Zebulun Arendsee


under construction

Bioinformaticians tend to be wizards at regular expressions (regex). We can spin out expressions like

sed 's/^>\([^ ]*\).*\[\([^]]*\)\]$/>\1|\2/' 

on the fly before our morning coffee.

But this magic has its limits. Let’s explore the parsing of tree files in newick format. Here is an example tree:

A diagram of the tree specified by the newick code on the left


The leaf names are always preceded by a left parenthesis or a comma. I can latch onto the pattern and

$ read -r tree << EOF
$ echo $tree | grep -Eo "[(,][A-Z]" | sed 's/^.//'

The read clause stores the tree file in the variable tree. Which I then echo into grep. The -E flag turns on POSIX regular expression support and the -o flag tells grep to print each match on a new row.

But what if the leaf names aren’t just a single capitol letter? They should be anything but a “,”, “:”, “;”, “(”, or “)”. To support this, I can adapt the grep regex as follows:


That works, but leaf names may be quoted. When quoted, they may contain any character other than an un-escaped end quote.


Lovely. This works for double quotes. But what if there are singly quotes also?


Note the little switch in bash quotation. This is becoming a little hard to read, even for my old bioinformatician eyes.

What if we wanted to grep out the parent node of a particular leaf?

$ read -r tree << EOF
$ echo $tree | ???

This is not easy. While it may be possible with sufficient Perl regex magic, parsing recursive data structures such as this is generally not possible with regular expressions.

What alternative is there? Lucky for us, parsing is a highly developed field. I will introduce you to a type of parser called a recursive descent parser.

Haskell has several powerful parser libraries. I will use Attoparsec. This library is blazing fast. It sacrifices flexibility, but for the relatively simple needs of parsing bioinformatics formats, it should be fine.

Our parser will transform a newick string into a Tree data structure. We can then write functions of the tree to do whatever we wish. For the tree data structure, we define a new Tree algebraic data type:

You can find my full code for this demo on github.

data Tree
  = Leaf Text  -- a leaf and its name
  | Node  -- an internal node
      [(Tree, Maybe Double)]  -- node children and optional edge lengths
      (Maybe Text) -- optional node name

This data structure can easily be explored through structural recursion. For example:

import Data.Maybe (fromMaybe)

-- collect the labels for all leaves
leafLabels :: Tree -> [Text]
leafLabels (Leaf label) = [label]
leafLabels (Node children _) = concatMap (leafLabels . fst) children

-- collect all leaf/parent pairs
leafParents :: Tree -> [(Text, Text)]
leafParents (Leaf _) = []
leafParents (Node children maybeName) = concatMap (f . fst) children where
    f :: Tree -> [(Text, Text)]
    f (Leaf leafName) = [(leafName, fromMaybe "-" maybeName)]
    f tree = leafParents tree

I am introducing two new functions here concatMap from Prelude and fromMaybe from the core module Data.Maybe. There signatures are as follows:

concatMap :: Foldable t => (a -> [b]) -> t a -> [b]
fromMaybe :: a -> Maybe a -> a

You should be able to guess what each function does from its signature. You might need to review the Foldable typeclass from the typeclass post.

Here is the skeleton of our parser program:

import Data.Attoparsec.Text
import Control.Applicative ((<|>))

newickParser :: Parser Tree
newickParser = do
    (tree, _) <- treeParser
    _ <- char ';'
    return tree

treeParser :: Parser (Tree, Maybe Double)
treeParser = leafParser <|> nodeParser

We first import the Attoparsec library for parsing Text data types. Previously, we’ve worked with String types and I mentioned that they are inefficient since the are implemented as linked lists of characters. The Text type is much more efficient and is used commonly in production code when performance is important. Text supports unicode characters, so our parser will handle Chinese characters, accents, emoticons, and whatever else might show up in our trees.

After the imports, I’ve defined four parser functions. The Parser type is the core monad used in Attoparsec. It wraps the objects we are parsing out of the text. Every parser we write will have Parser a as the return type.

The top-level parser newickParser handles a complete newick file. It first calls the treeParser parser to get a tree, then it matches against the required terminal semicolon of a newick file, and finally it returns the tree. Since Parser is a monad, we can use the do notation to implement parsers in a step-by-step iterative style.

treeParser is composed of two parsers: leafParser and nodeParser. <|> is the “alternative” operator. It indicates that if leafParser fails to match the text, then nodeParser is called instead.

This should give you a feel for the structure of a recursive descent parser. They are defined top down. Each parser is defined in terms of sub-parsers. All the way down to primitive matchers.

The Parser monad stores an index into the text that is being parsed. As parsers match patterns to text, the index increments, “consuming” input. When a parser fails, any update to index is not automatically undone. That is, a “undo” trace is not maintained. For this reason, you need to mindful of patterns that may partially match. In these cases, you may use the try function to store the original index.

Moving on leafParser and nodeParser may be defined as follows:

leafParser :: Parser (Tree, Maybe Double)
leafParser = do
  -- match a leaf label (e.g., `unicorn` or `'a weird bug'`)
  leaf <- nameParser

  -- match a branch length (e.g., `:4.5`)
  mayLength <- optional lengthParser

  -- return the leaf and the optional branch length
  return (Leaf leaf, mayLength)

nodeParser :: Parser (Tree, Maybe Double)
nodeParser = do
  -- match the opening parenthesis of a subtree
  _ <- char '('

  -- match a comma separated list of child trees
  children <- sepBy1 treeParser (char ',')

  -- match the closing parenthesis of a subtree
  _ <- char ')'

  -- match the node label and length
  mayName <- optional nameParser
  mayLength <- optional lengthParser

  -- return the node and optional length
  return (Node children mayName, mayLength)

nameParser and lengthParser will parse names (optionally quoted) and branch lengths. I’ll define them shortly. The optional function transforms a parser to return a Maybe value (where Nothing indicates a failure).

The lengthParser is simple:

lengthParser :: Parser Double
lengthParser = char ':' >> double

The nameParser is by far the most complex piece of the parser since it has to handle quotation:

nameParser :: Parser Text
nameParser = quotedName '"' <|> quotedName '\'' <|> unquotedName
  unquotedName :: Parser Text
  unquotedName =
    -- take a string of legal characters
    takeWhile1 (notInClass ",():;'\"")

  quotedName :: Char -> Parser Text
  quotedName quoteChar = do

    -- match opening quote
    _ <- char quoteChar

    -- match the name
    -- the name is comprised of a concatenation of strings of unquoted
    -- characters and single escaped characters
    name <- foldl1 (<>) <$> many1 (unescapedChar quoteChar <|> escapedChar)

    -- match closing quote
    _ <- char quoteChar
    return name

  unescapedChar :: Char -> Parser Text
  unescapedChar quoteChar =
    -- match text with characters other than foward slashes or quote characters
    takeWhile1 (\x -> x /= '\\' && x /= quoteChar)

  escapedChar :: Parser Text
  escapedChar = do
    -- match any character that is preceded by a forward slash
    c <- char '\\' >> anyChar

    -- convert the Char to Text and return
    return $ singleton c

The >> operator throws away the value stored in a monad while preserving the monadic state. The operator has the type:

(>>) :: Monad m => m a -> m b -> m b

In contrast, the monadic bind operator has the type:

(>>=) :: Monad m => m a -> (a -> m b) -> m b

See the repo on github for the full code. I have simplified the programming a bit here. In the GitHub repo, I also account for whitespace.

Writing a parser that builds a well-defined data structure, as we have done here, requires more work than just spinning out a fast, specialized regular expression and munging the results with sed and awk. But the extra effort makes the code more robust, extensible, readable, and modular.

built on 2024-06-03 01:05:44.888648215 UTC from file 2024-04-15-parsing