Before we continue …
A Word from the Sponsor!
They are just a versatile abstraction, like map or fold.
Parsers
A parser is a function that
- converts unstructured data (e.g.
String, array ofByte,…) - into structured data (e.g. JSON object, Markdown, Video…)
Every large software system contains a Parser
| System | Parses |
|---|---|
| Shell Scripts | Command-line options |
| Browsers | HTML |
| Games | Level descriptors |
| Routers | Packets |
| Netflix | Video |
| Spotify | Audio, Playlists… |
How to build Parsers?
Two standard methods
Regular Expressions
- Doesn’t really scale beyond simple things
- No nesting, recursion
Parser Generators
- Specify grammar via rules
- Tools like
yacc,bison,antlr,happy
- convert grammar into executable function
Grammars Don’t Compose!
If we have two kinds of structured objects Thingy and Whatsit.
To parse sequences of Thingy and Whatsit we must duplicate the rules
Thingies : Thingy Thingies { ... }
EmptyThingy { ... }
;
Whatsits : Whatsit Whatsits { ... }
EmptyWhatsit { ... }
;No nice way to reuse the sub-parsers for Whatsit and Thingy :-(
A New Hope: Parsers as Functions
Lets think of parsers directly as functions that
- Take as input a
String - Convert a part of the input into a
StructuredObject - Return the remainder unconsumed to be parsed later
A Parser a
- Converts a prefix of a
String - Into a structured object of type
aand - Returns the suffix
Stringunchanged
Parsers Can Produce Many Results
Sometimes we want to parse a String like
into a list of possible results
So we generalize the Parser type to
EXERCISE
Given the definition
Implement a function
QUIZ
Given the definition
Which of the following is a valid Parser Char
- that returns the first
Charfrom a string (if one exists)
-- A
oneChar = P (\cs -> head cs)
-- B
oneChar = P (\cs -> case cs of
[] -> [('', [])
c:cs -> (c, cs))
-- C
oneChar = P (\cs -> (head cs, tail cs))
-- D
oneChar = P (\cs -> [(head cs, tail cs)])
-- E
oneChar = P (\cs -> case cs of
[] -> []
cs -> [(head cs, tail cs)])
Lets Run Our First Parser!
>>> runParser oneChar "hey!"
[('h', "ey")]
>>> runParser oneChar "yippee"
[('y', "ippee")]
>>> runParser oneChar ""
[]Failure to parse means result is an empty list!
EXERCISE
Your turn: Write a parser to grab first two chars
When you are done, we should get
QUIZ
Ok, so recall
twoChar :: Parser (Char, Char)
twoChar = P (\cs -> case cs of
c1:c2:cs' -> [((c1, c2), cs')]
_ -> [])Suppose we had some foo such that twoChar' was equivalent to twoChar
What must the type of foo be?
A. Parser (Char, Char)
B. Parser Char -> Parser (Char, Char)
C. Parser a -> Parser a -> Parser (a, a)
D. Parser a -> Parser b -> Parser (a, b)
E. Parser a -> Parser (a, a)
EXERCISE: A forEach Loop
Lets write a function
such that we get the following behavior
>>> forEach [] (\i -> [i, i + 1])
[]
>>> forEach [10,20,30] (\i -> [show i, show (i+1)])
["10", "11", "20", "21", "30", "31"]
QUIZ
What does quiz evaluate to?
A. [10,20,30,0,1,2]
B. [10,0,20,1,30,2]
C. [[10,11,12], [20,21,22] [30,31,32]]
D. [10,11,12,20,21,22,30,31,32]
E. [32]
A pairP Combinator
Lets implement the above as pairP
forEach :: [a] -> (a -> [b]) -> [b]
forEach xs f = concatMap f xs
pairP :: Parser a -> Parser b -> Parser (a, b)
pairP aP bP = P (\s -> forEach (runParser aP s) (\(a, s') ->
forEach (runParser bP s') (\(b, s'') ->
((a, b), s'')
)
) Now we can write
QUIZ
What does quiz evaluate to?
A. [((h,h), "")]
B. [(h, "")]
C. [("", "")]
D. []
E. Run-time exception
Does the Parser a type remind you of something?
Lets implement the above as pairP
Parser is a Monad!
Like a state transformer, Parser is a monad!
We need to implement two functions
QUIZ
Which of the following is a valid implementation of returnP
data Parser a = P (String -> [(a, String)])
returnP :: a -> Parser a
returnP a = P (\s -> []) -- A
returnP a = P (\s -> [(a, s)]) -- B
returnP a = P (\s -> (a, s)) -- C
returnP a = P (\s -> [(a, "")]) -- D
returnP a = P (\s -> [(s, a)]) -- E
HINT: return a should just
- “produce” the parse result
aand - leave the string unconsumed.
Bind
Next, lets implement bindP
- we almost saw it as
pairP
bindP :: Parser a -> (a -> Parser b) -> Parser b
bindP aP fbP = P (\s ->
forEach (runParser aP s) (\(a, s') ->
forEach (runParser (fbP a) s') (\(b, s'') ->
[(b, s'')]
)
)
)The function
- Builds the
avalues out ofaP(usingrunParser) - Builds the
bvalues by callingfbP aon the remainder strings' - Returns
bvalues and the remainder strings''

The Parser Monad
We can now make Parser an instance of Monad
And now, let the wild rumpus start!
Parser Combinators
Lets write lots of high-level operators to combine parsers!
Here’s a cleaned up pairP
Failures are the Pillars of Success!
Surprisingly useful, always fails
- i.e. returns
[]no successful parses
QUIZ
Consider the parser
What is the value of
quiz1 = runParser (satP (\c -> c == 'h')) "hellow"
quiz2 = runParser (satP (\c -> c == 'h')) "yellow"quiz1 |
quiz2 |
|
|---|---|---|
| A | [] |
[] |
| B | [('h', "ellow")] |
[('y', "ellow")] |
| C | [('h', "ellow")] |
[] |
| D | [] |
[('y', "ellow")] |
Parsing Alphabets and Numerics
We can now use satP to write
-- parse ONLY the Char c
char :: Parser Char
char c = satP (\c' -> c == c')
-- parse ANY ALPHABET
alphaCharP :: Parser Char
alphaCharP = satP isAlpha
-- parse ANY NUMERIC DIGIT
digitChar :: Parser Char
digitChar = satP isDigit
QUIZ
We can parse a single Int digit
digitInt :: Parser Int
digitInt = do
c <- digitChar -- parse the Char c
return (read [c]) -- convert Char to IntWhat is the result of
quiz1 |
quiz2 |
|
|---|---|---|
| A | [] |
[] |
| B | [('9', "2")] |
[('c', "at")] |
| C | [(9, "2")] |
[] |
| D | [] |
[('c', "at")] |
EXERCISE
Write a function
when you are done, we should get the following behavior
>>> dogeP = strP "doge"
>>> runParser dogeP "dogerel"
[("doge", "rel")]
>>> runParser dogeP "doggoneit"
[]
A Choice Combinator
Lets write a combinator orElse p1 p2 such that
- returns the results of
p1
or, else if those are empty
- returns the results of
p2
:: Parser a -> Parser a -> Parser a
orElse p1 p2 = -- produce results of `p1` if non-empty
-- OR-ELSE results of `p2`e.g. orElseP lets us build a parser that produces an alphabet OR a numeric character
Which should produce
>>> runParser alphaNumChar "cat"
[('c', "at")]
>>> runParser alphaNumChar "2cat"
[('2', "cat")]
>>> runParser alphaNumChar "230"
[('2', "30")]
QUIZ
orElse :: Parser a -> Parser a -> Parser a
orElse p1 p2 = -- produce results of `p1` if non-empty
-- OR-ELSE results of `p2`Which of the following implements orElse?
-- a
orElse p1 p2 = do
r1s <- p1
r2s <- p2
return (r1s ++ r2s)
-- b
orElse p1 p2 = do
r1s <- p1
case r1s of
[] -> p2
_ -> return r1s
-- c
orElse p1 p2 = P (\cs ->
runParser p1 cs ++ runParser p2 cs
)
-- d
orElse p1 p2 = P (\cs ->
case runParser p1 cs of
[] -> runParser p2 cs
r1s -> r1s
)
An “Operator” for orElse
It will be convenient to have a short “operator” for orElse
A Simple Expression Parser
Now, lets write a tiny calculator!
-- 1. First, parse the operator
intOp :: Parser (Int -> Int -> Int)
intOp = plus <|> minus <|> times <|> divide
where
plus = do { _ <- char '+'; return (+) }
minus = do { _ <- char '-'; return (-) }
times = do { _ <- char '*'; return (*) }
divide = do { _ <- char '/'; return div }
-- 2. Now parse the expression!
calc :: Parser Int
calc = do x <- digitInt
op <- intOp
y <- digitInt
return (x `op` y)When calc is run, it will both parse and calculate
>>> runParser calc "8/2"
[(4,"")]
>>> runParser calc "8+2cat"
[(10,"cat")]
>>> runParser calc "8/2cat"
[(4,"cat")]
>>> runParser calc "8-2cat"
[(6,"cat")]
>>> runParser calc "8*2cat"
[(16,"cat")]
QUIZ
What will quiz evaluate to?
calc :: Parser Int
calc = do x <- digitInt
op <- intOp
y <- digitInt
return (x `op` y)
quiz = runParser calc "99bottles"A. Type error
B. []
C. [(9, "9bottles")]
D. [(99, "bottles")]
E. Run-time exception
Next: Recursive Parsing
Its cool to parse individual Char …
… but way more interesting to parse recursive structures!
EXERCISE: A “Recursive” String Parser
The parser string s parses exactly the string s - fails otherwise
>>> runParser (string "mic") "mickeyMouse"
[("mic","keyMouse")]
>>> runParser (string "mic") "donald duck"
[]Lets fill in an implementation
Which library function will eliminate the recursion from string?
QUIZ: Parsing Many Times
Often we want to repeat parsing some object
-- | `manyP p` repeatedly runs `p` to return a list of [a]
manyP :: Parser a -> Parser [a]
manyP p = m0 <|> m1
where
m0 = return []
m1 = do { x <- p; xs <- manyP p; return (x:xs) } Recall digitChar :: Parser Char returned a single numeric Char
What will quiz evaluate to?
A. [("" , "1234horse")]
B. [("1" , "234horse")]
C. [("1", "23horse"), ("12", "3horse"), ("123", "horse )]
D. [("123", "horse")]
E. []
Lets fix manyP!
Run p first and only return [] if it fails …
-- | `manyP p` repeatedly runs `p` to return a list of [a]
manyP :: Parser a -> Parser [a]
manyP p = m1 <|> m0
where
m0 = return []
m1 = do { x <- p; xs <- manyP p; return (x:xs) } now, we can write an Int parser as
which will produce
Parsing Arithmetic Expressions
Now we can build a proper calculator!
calc0 :: Parser Int
calc0 = binExp <|> int
int :: Parser Int
int = do
xs <- many digitChar
return (read xs)
binExp :: Parser Int
binExp = do
x <- int
o <- intOp
y <- calc0
return (x `o` y) Works pretty well!
QUIZ
calc0 :: Parser Int
calc0 = binExp <|> int
int :: Parser Int
int = do
xs <- many digitChar
return (read xs)
binExp :: Parser Int
binExp = do
x <- int
o <- intOp
y <- calc0
return (x `o` y) What does quiz evaluate to?
A. [(0, "")]
B. []
C. [(10, "")]
D. [(10, "-5-5")]
E. [(5, "-5")]
Problem: Right-Associativity
Recall
"10-5-5" gets parsed as 10 - (5 - 5) because

The calc0 parser implicitly forces each operator to be right associative
doesn’t matter for
+,*but is incorrect for
-
QUIZ
Recall
What does quiz get evaluated to?
A. [(1020,"")]
B. [(120,"")]
C. [(120,""), (1020, "")]
D. [(1020,""), (120, "")]
E. []
The calc0 parser implicitly forces all operators to be right associative
doesn’t matter for
+,*but is incorrect for
-
Does not respect precedence!
Simple Fix: Parentheses!
Lets write a combinator that parses something within (...)
now we can try
calc1 :: Parser Int
calc1 = parens binExp <|> int
binExp :: Parser Int
binExp = do
x <- int
o <- intOp
y <- calc1
return (x `o` y)now the original string wont even parse
but we can add parentheses to get the right result
>>> runParser calc1 "((10-5)-5)"
[(0 ,"")]
>>> runParser calc1 "(10-(5-5))"
[(10 ,"")]
>>> runParser calc1 "((10*2)+100)"
[(120, "")]
>>> runParser calc1 "(10*(2+100))"
[(1020, "")]
Left Associativity
But how to make the parser left associative
- i.e. parse “10-5-5” as
(10 - 5) - 5?
Lets flip the order!
calc1 :: Parser Int
calc1 = binExp <|> oneInt
binExp :: Parser Int
binExp = do
x <- calc1
o <- intOp
y <- int
return (x `o` y)But …
Infinite loop! calc1 --> binExp --> calc1 --> binExp --> ...
- without consuming any input :-(
Solution: Parsing with Multiple Levels
Any expression is a sum-of-products
10 * 20 * 30 + 40 * 50 + 60 * 70 * 80
=>
((((10 * 20) * 30) + (40 * 50)) + ((60 * 70) * 80))
=>
((((base * base) * base) + (base * base)) + ((base * base) * base))
=>
(((prod * base) + prod) + (prod * base))
=>
((prod + prod) + prod)
=>
(sum + prod)
=>
sum
=>
expr
Parsing with Multiple Levels
So lets layer our language as
expr :== sum
sum :== (((prod "+" prod) "+" prod) "+" ... "+" prod)
prod :== (((base "*" base) "*" base) "*" ... "*" base)
base :== "(" expr ")" ORELSE intthat is the recursion looks like
No infinite loop!
expr --> prod --> base -->* exprbut last step
-->*consumes a(
Parsing oneOrMore
Lets implement oneOrMore vP oP as a combinator - vP parses a single a value - oP parses an operator a -> a -> a - oneOrMore vP oP parses and returns the result ((v1 o v2) o v3) o v4) o ... o vn)
But how?
grab the first
v1usingvP- continue by
- either trying
oPthenv2… and recursively continue withv1 o v2 orElse(no moreo) just returnv1
- either trying
oneOrMore :: Parser a -> Parser (a -> a -> a) -> Parser a
oneOrMore vP oP = do {v1 <- vP; continue v1}
where
continue v1 = do { o <- oP; v2 <- vP; continue (v1 `o` v2) }
<|> return v1
Implementing Layered Parser
Now we can implement the grammar
simply as
where addOp is + or - and mulOp is * or /
addOp, mulOp :: Parser (Int -> Int -> Int)
addOp = constP "+" (+) <|> constP "-" (-)
mulOp = constP "*" (*) <|> constP "/" div
constP :: String -> a -> Parser a
constP s x = do { _ <- string s; return x }Lets make sure it works!
>>> doParse sumE2 "10-1-1"
[(8,"")]
>>> doParse sumE2 "10*2+1"
[(21,"")]
>>> doParse sumE2 "10+2*1"
[(12,"")]
Parser combinators
That was a taste of Parser Combinators
- Transferred from Haskell to many other languages.
Many libraries including Parsec used in your homework - oneOrMore is called chainl
Read more about the theory - in these recent papers
Read more about the practice - in this recent post that I like JSON parsing from scratch