under construction
Bioinformaticians tend to be wizards at regular expressions (regex). We can spin out expressions like
sed 's/^>\([^ ]*\).*\[\([^]]*\)\]$/>\1|\2/'
on the fly before our morning coffee.
But this magic has its limits. Let’s explore the parsing of tree files in newick format. Here is an example tree:
A diagram of the tree specified by the newick code on the left
((A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F:0.6,G:0.7)H:0.8;
The leaf names are always preceded by a left parenthesis or a comma. I can latch onto the pattern and
$ read -r tree << EOF
((A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F:0.6,G:0.7)H:0.8;
EOF
$ echo $tree | grep -Eo "[(,][A-Z]" | sed 's/^.//'
A
B
C
D
G
The read clause stores the tree file in the variable tree. Which I then echo
into grep. The -E flag turns on POSIX regular expression support and the
-o flag tells grep to print each match on a new row.
But what if the leaf names aren’t just a single capitol letter? They should be
anything but a “,”, “:”, “;”, “(”, or “)”. To support this, I can adapt the
grep regex as follows:
"[(,][^,:;()]+"
That works, but leaf names may be quoted. When quoted, they may contain any character other than an un-escaped end quote.
'[(,]([^,:;()]+|"(\\"|[^,:;()"])+")'
Lovely. This works for double quotes. But what if there are singly quotes also?
'[(,]([^,:;()]+|"(\\"|[^,:;()"])+"|'"'(\\'|[^,:;()'])+')"
Note the little switch in bash quotation. This is becoming a little hard to read, even for my old bioinformatician eyes.
What if we wanted to grep out the parent node of a particular leaf?
$ read -r tree << EOF
((A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F:0.6,G:0.7)H:0.8;
EOF
$ echo $tree | ???
A F
B F
C E
D E
G H
This is not easy. While it may be possible with sufficient Perl regex magic, parsing recursive data structures such as this is generally not possible with regular expressions.
What alternative is there? Lucky for us, parsing is a highly developed field. I will introduce you to a type of parser called a recursive descent parser.
Haskell has several powerful parser libraries. I will use Attoparsec. This library is blazing fast. It sacrifices flexibility, but for the relatively simple needs of parsing bioinformatics formats, it should be fine.
Our parser will transform a newick string into a Tree data structure. We can
then write functions of the tree to do whatever we wish. For the tree data
structure, we define a new Tree algebraic data type:
You can find my full code for this demo on github.
data Tree
= Leaf Text -- a leaf and its name
| Node -- an internal node
[(Tree, Maybe Double)] -- node children and optional edge lengths
(Maybe Text) -- optional node nameThis data structure can easily be explored through structural recursion. For example:
import Data.Maybe (fromMaybe)
-- collect the labels for all leaves
leafLabels :: Tree -> [Text]
leafLabels (Leaf label) = [label]
leafLabels (Node children _) = concatMap (leafLabels . fst) children
-- collect all leaf/parent pairs
leafParents :: Tree -> [(Text, Text)]
leafParents (Leaf _) = []
leafParents (Node children maybeName) = concatMap (f . fst) children where
f :: Tree -> [(Text, Text)]
f (Leaf leafName) = [(leafName, fromMaybe "-" maybeName)]
f tree = leafParents treeI am introducing two new functions here concatMap from Prelude and fromMaybe
from the core module Data.Maybe. There signatures are as follows:
concatMap :: Foldable t => (a -> [b]) -> t a -> [b]
fromMaybe :: a -> Maybe a -> aYou should be able to guess what each function does from its signature. You
might need to review the Foldable typeclass from the typeclass
post.
Here is the skeleton of our parser program:
import Data.Attoparsec.Text
import Control.Applicative ((<|>))
newickParser :: Parser Tree
newickParser = do
(tree, _) <- treeParser
_ <- char ';'
return tree
treeParser :: Parser (Tree, Maybe Double)
treeParser = leafParser <|> nodeParserWe first import the Attoparsec library for parsing Text data
types. Previously, we’ve worked with String types and I mentioned that they
are inefficient since the are implemented as linked lists of characters. The
Text type is much more efficient and is used commonly in production
code when performance is important. Text supports unicode characters, so our
parser will handle Chinese characters, accents, emoticons, and whatever else
might show up in our trees.
After the imports, I’ve defined four parser functions. The Parser type is the
core monad used in Attoparsec. It wraps the objects we are parsing out of the
text. Every parser we write will have Parser a as the return type.
The top-level parser newickParser handles a complete newick file. It first
calls the treeParser parser to get a tree, then it matches against the
required terminal semicolon of a newick file, and finally it returns the
tree. Since Parser is a monad, we can use the do notation to implement
parsers in a step-by-step iterative style.
treeParser is composed of two parsers: leafParser and nodeParser. <|> is
the “alternative” operator. It indicates that if leafParser fails to match the
text, then nodeParser is called instead.
This should give you a feel for the structure of a recursive descent parser. They are defined top down. Each parser is defined in terms of sub-parsers. All the way down to primitive matchers.
The Parser monad stores an index into the text that is being parsed. As
parsers match patterns to text, the index increments, “consuming” input. When a
parser fails, any update to index is not automatically undone. That is, a “undo”
trace is not maintained. For this reason, you need to mindful of patterns that
may partially match. In these cases, you may use the try function to store the
original index.
Moving on leafParser and nodeParser may be defined as follows:
leafParser :: Parser (Tree, Maybe Double)
leafParser = do
-- match a leaf label (e.g., `unicorn` or `'a weird bug'`)
leaf <- nameParser
-- match a branch length (e.g., `:4.5`)
mayLength <- optional lengthParser
-- return the leaf and the optional branch length
return (Leaf leaf, mayLength)
nodeParser :: Parser (Tree, Maybe Double)
nodeParser = do
-- match the opening parenthesis of a subtree
_ <- char '('
-- match a comma separated list of child trees
children <- sepBy1 treeParser (char ',')
-- match the closing parenthesis of a subtree
_ <- char ')'
-- match the node label and length
mayName <- optional nameParser
mayLength <- optional lengthParser
-- return the node and optional length
return (Node children mayName, mayLength)nameParser and lengthParser will parse names (optionally quoted) and branch
lengths. I’ll define them shortly. The optional function transforms a parser
to return a Maybe value (where Nothing indicates a failure).
The lengthParser is simple:
lengthParser :: Parser Double
lengthParser = char ':' >> doubleThe nameParser is by far the most complex piece of the parser since it has to
handle quotation:
nameParser :: Parser Text
nameParser = quotedName '"' <|> quotedName '\'' <|> unquotedName
where
unquotedName :: Parser Text
unquotedName =
-- take a string of legal characters
takeWhile1 (notInClass ",():;'\"")
quotedName :: Char -> Parser Text
quotedName quoteChar = do
-- match opening quote
_ <- char quoteChar
-- match the name
-- the name is comprised of a concatenation of strings of unquoted
-- characters and single escaped characters
name <- foldl1 (<>) <$> many1 (unescapedChar quoteChar <|> escapedChar)
-- match closing quote
_ <- char quoteChar
return name
unescapedChar :: Char -> Parser Text
unescapedChar quoteChar =
-- match text with characters other than foward slashes or quote characters
takeWhile1 (\x -> x /= '\\' && x /= quoteChar)
escapedChar :: Parser Text
escapedChar = do
-- match any character that is preceded by a forward slash
c <- char '\\' >> anyChar
-- convert the Char to Text and return
return $ singleton cThe >> operator throws away the value stored in a monad while preserving the
monadic state. The operator has the type:
(>>) :: Monad m => m a -> m b -> m bIn contrast, the monadic bind operator has the type:
(>>=) :: Monad m => m a -> (a -> m b) -> m bSee the repo on github for the full code. I have simplified the programming a bit here. In the GitHub repo, I also account for whitespace.
Writing a parser that builds a well-defined data structure, as we have done here, requires more work than just spinning out a fast, specialized regular expression and munging the results with sed and awk. But the extra effort makes the code more robust, extensible, readable, and modular.