Before we continue …
A Word from the Sponsor!
They are just a versatile abstraction, like map
or fold
.
Parsers
A parser is a function that
- converts unstructured data (e.g.
String
, array ofByte
,…) - into structured data (e.g. JSON object, Markdown, Video…)
Every large software system contains a Parser
System | Parses |
---|---|
Shell Scripts | Command-line options |
Browsers | HTML |
Games | Level descriptors |
Routers | Packets |
Netflix | Video |
Spotify | Audio, Playlists… |
How to build Parsers?
Two standard methods
Regular Expressions
- Doesn’t really scale beyond simple things
- No nesting, recursion
Parser Generators
- Specify grammar via rules
- Tools like
yacc
,bison
,antlr
,happy
- convert grammar into executable function
Grammars Don’t Compose!
If we have two kinds of structured objects Thingy
and Whatsit
.
To parse sequences of Thingy
and Whatsit
we must duplicate the rules
Thingies : Thingy Thingies { ... }
EmptyThingy { ... }
;
Whatsits : Whatsit Whatsits { ... }
EmptyWhatsit { ... }
;
No nice way to reuse the sub-parsers for Whatsit
and Thingy
:-(
A New Hope: Parsers as Functions
Lets think of parsers directly as functions that
- Take as input a
String
- Convert a part of the input into a
StructuredObject
- Return the remainder unconsumed to be parsed later
A Parser a
- Converts a prefix of a
String
- Into a structured object of type
a
and - Returns the suffix
String
unchanged
Parsers Can Produce Many Results
Sometimes we want to parse a String
like
into a list of possible results
So we generalize the Parser
type to
EXERCISE
Given the definition
Implement a function
QUIZ
Given the definition
Which of the following is a valid Parser Char
- that returns the first
Char
from a string (if one exists)
-- A
oneChar = P (\cs -> head cs)
-- B
oneChar = P (\cs -> case cs of
[] -> [('', [])
c:cs -> (c, cs))
-- C
oneChar = P (\cs -> (head cs, tail cs))
-- D
oneChar = P (\cs -> [(head cs, tail cs)])
-- E
oneChar = P (\cs -> case cs of
[] -> []
cs -> [(head cs, tail cs)])
Lets Run Our First Parser!
>>> runParser oneChar "hey!"
[('h', "ey")]
>>> runParser oneChar "yippee"
[('y', "ippee")]
>>> runParser oneChar ""
[]
Failure to parse means result is an empty list!
EXERCISE
Your turn: Write a parser to grab first two chars
When you are done, we should get
QUIZ
Ok, so recall
twoChar :: Parser (Char, Char)
twoChar = P (\cs -> case cs of
c1:c2:cs' -> [((c1, c2), cs')]
_ -> [])
Suppose we had some foo
such that twoChar'
was equivalent to twoChar
What must the type of foo
be?
A. Parser (Char, Char)
B. Parser Char -> Parser (Char, Char)
C. Parser a -> Parser a -> Parser (a, a)
D. Parser a -> Parser b -> Parser (a, b)
E. Parser a -> Parser (a, a)
EXERCISE: A forEach
Loop
Lets write a function
such that we get the following behavior
>>> forEach [] (\i -> [i, i + 1])
[]
>>> forEach [10,20,30] (\i -> [show i, show (i+1)])
["10", "11", "20", "21", "30", "31"]
QUIZ
What does quiz
evaluate to?
A. [10,20,30,0,1,2]
B. [10,0,20,1,30,2]
C. [[10,11,12], [20,21,22] [30,31,32]]
D. [10,11,12,20,21,22,30,31,32]
E. [32]
A pairP
Combinator
Lets implement the above as pairP
forEach :: [a] -> (a -> [b]) -> [b]
forEach xs f = concatMap f xs
pairP :: Parser a -> Parser b -> Parser (a, b)
pairP aP bP = P (\s -> forEach (runParser aP s) (\(a, s') ->
forEach (runParser bP s') (\(b, s'') ->
((a, b), s'')
)
)
Now we can write
QUIZ
What does quiz
evaluate to?
A. [((
h,
h), "")]
B. [(
h, "")]
C. [("", "")]
D. []
E. Run-time exception
Does the Parser a
type remind you of something?
Lets implement the above as pairP
Parser
is a Monad!
Like a state transformer, Parser
is a monad!
We need to implement two functions
QUIZ
Which of the following is a valid implementation of returnP
data Parser a = P (String -> [(a, String)])
returnP :: a -> Parser a
returnP a = P (\s -> []) -- A
returnP a = P (\s -> [(a, s)]) -- B
returnP a = P (\s -> (a, s)) -- C
returnP a = P (\s -> [(a, "")]) -- D
returnP a = P (\s -> [(s, a)]) -- E
HINT: return a
should just
- “produce” the parse result
a
and - leave the string unconsumed.
Bind
Next, lets implement bindP
- we almost saw it as
pairP
bindP :: Parser a -> (a -> Parser b) -> Parser b
bindP aP fbP = P (\s ->
forEach (runParser aP s) (\(a, s') ->
forEach (runParser (fbP a) s') (\(b, s'') ->
[(b, s'')]
)
)
)
The function
- Builds the
a
values out ofaP
(usingrunParser
) - Builds the
b
values by callingfbP a
on the remainder strings'
- Returns
b
values and the remainder strings''
The Parser
Monad
We can now make Parser
an instance of Monad
And now, let the wild rumpus start!
Parser Combinators
Lets write lots of high-level operators to combine parsers!
Here’s a cleaned up pairP
Failures are the Pillars of Success!
Surprisingly useful, always fails
- i.e. returns
[]
no successful parses
QUIZ
Consider the parser
What is the value of
quiz1 = runParser (satP (\c -> c == 'h')) "hellow"
quiz2 = runParser (satP (\c -> c == 'h')) "yellow"
quiz1 |
quiz2 |
|
---|---|---|
A | [] |
[] |
B | [('h', "ellow")] |
[('y', "ellow")] |
C | [('h', "ellow")] |
[] |
D | [] |
[('y', "ellow")] |
Parsing Alphabets and Numerics
We can now use satP
to write
-- parse ONLY the Char c
char :: Parser Char
char c = satP (\c' -> c == c')
-- parse ANY ALPHABET
alphaCharP :: Parser Char
alphaCharP = satP isAlpha
-- parse ANY NUMERIC DIGIT
digitChar :: Parser Char
digitChar = satP isDigit
QUIZ
We can parse a single Int
digit
digitInt :: Parser Int
digitInt = do
c <- digitChar -- parse the Char c
return (read [c]) -- convert Char to Int
What is the result of
quiz1 |
quiz2 |
|
---|---|---|
A | [] |
[] |
B | [('9', "2")] |
[('c', "at")] |
C | [(9, "2")] |
[] |
D | [] |
[('c', "at")] |
EXERCISE
Write a function
when you are done, we should get the following behavior
>>> dogeP = strP "doge"
>>> runParser dogeP "dogerel"
[("doge", "rel")]
>>> runParser dogeP "doggoneit"
[]
A Choice Combinator
Lets write a combinator orElse p1 p2
such that
- returns the results of
p1
or, else if those are empty
- returns the results of
p2
:: Parser a -> Parser a -> Parser a
orElse p1 p2 = -- produce results of `p1` if non-empty
-- OR-ELSE results of `p2`
e.g. orElseP
lets us build a parser that produces an alphabet OR a numeric character
Which should produce
>>> runParser alphaNumChar "cat"
[('c', "at")]
>>> runParser alphaNumChar "2cat"
[('2', "cat")]
>>> runParser alphaNumChar "230"
[('2', "30")]
QUIZ
orElse :: Parser a -> Parser a -> Parser a
orElse p1 p2 = -- produce results of `p1` if non-empty
-- OR-ELSE results of `p2`
Which of the following implements orElse
?
-- a
orElse p1 p2 = do
r1s <- p1
r2s <- p2
return (r1s ++ r2s)
-- b
orElse p1 p2 = do
r1s <- p1
case r1s of
[] -> p2
_ -> return r1s
-- c
orElse p1 p2 = P (\cs ->
runParser p1 cs ++ runParser p2 cs
)
-- d
orElse p1 p2 = P (\cs ->
case runParser p1 cs of
[] -> runParser p2 cs
r1s -> r1s
)
An “Operator” for orElse
It will be convenient to have a short “operator” for orElse
A Simple Expression Parser
Now, lets write a tiny calculator!
-- 1. First, parse the operator
intOp :: Parser (Int -> Int -> Int)
intOp = plus <|> minus <|> times <|> divide
where
plus = do { _ <- char '+'; return (+) }
minus = do { _ <- char '-'; return (-) }
times = do { _ <- char '*'; return (*) }
divide = do { _ <- char '/'; return div }
-- 2. Now parse the expression!
calc :: Parser Int
calc = do x <- digitInt
op <- intOp
y <- digitInt
return (x `op` y)
When calc
is run, it will both parse and calculate
>>> runParser calc "8/2"
[(4,"")]
>>> runParser calc "8+2cat"
[(10,"cat")]
>>> runParser calc "8/2cat"
[(4,"cat")]
>>> runParser calc "8-2cat"
[(6,"cat")]
>>> runParser calc "8*2cat"
[(16,"cat")]
QUIZ
What will quiz
evaluate to?
calc :: Parser Int
calc = do x <- digitInt
op <- intOp
y <- digitInt
return (x `op` y)
quiz = runParser calc "99bottles"
A. Type error
B. []
C. [(9, "9bottles")]
D. [(99, "bottles")]
E. Run-time exception
Next: Recursive Parsing
Its cool to parse individual Char
…
… but way more interesting to parse recursive structures!
EXERCISE: A “Recursive” String Parser
The parser string s
parses exactly the string s
- fails otherwise
>>> runParser (string "mic") "mickeyMouse"
[("mic","keyMouse")]
>>> runParser (string "mic") "donald duck"
[]
Lets fill in an implementation
Which library function will eliminate the recursion from string
?
QUIZ: Parsing Many Times
Often we want to repeat parsing some object
-- | `manyP p` repeatedly runs `p` to return a list of [a]
manyP :: Parser a -> Parser [a]
manyP p = m0 <|> m1
where
m0 = return []
m1 = do { x <- p; xs <- manyP p; return (x:xs) }
Recall digitChar :: Parser Char
returned a single numeric Char
What will quiz
evaluate to?
A. [("" , "1234horse")]
B. [("1" , "234horse")]
C. [("1", "23horse"), ("12", "3horse"), ("123", "horse )]
D. [("123", "horse")]
E. []
Lets fix manyP
!
Run p
first and only return []
if it fails …
-- | `manyP p` repeatedly runs `p` to return a list of [a]
manyP :: Parser a -> Parser [a]
manyP p = m1 <|> m0
where
m0 = return []
m1 = do { x <- p; xs <- manyP p; return (x:xs) }
now, we can write an Int
parser as
which will produce
Parsing Arithmetic Expressions
Now we can build a proper calculator!
calc0 :: Parser Int
calc0 = binExp <|> int
int :: Parser Int
int = do
xs <- many digitChar
return (read xs)
binExp :: Parser Int
binExp = do
x <- int
o <- intOp
y <- calc0
return (x `o` y)
Works pretty well!
QUIZ
calc0 :: Parser Int
calc0 = binExp <|> int
int :: Parser Int
int = do
xs <- many digitChar
return (read xs)
binExp :: Parser Int
binExp = do
x <- int
o <- intOp
y <- calc0
return (x `o` y)
What does quiz
evaluate to?
A. [(0, "")]
B. []
C. [(10, "")]
D. [(10, "-5-5")]
E. [(5, "-5")]
Problem: Right-Associativity
Recall
"10-5-5"
gets parsed as 10 - (5 - 5)
because
The calc0
parser implicitly forces each operator to be right associative
doesn’t matter for
+
,*
but is incorrect for
-
QUIZ
Recall
What does quiz
get evaluated to?
A. [(1020,"")]
B. [(120,"")]
C. [(120,""), (1020, "")]
D. [(1020,""), (120, "")]
E. []
The calc0
parser implicitly forces all operators to be right associative
doesn’t matter for
+
,*
but is incorrect for
-
Does not respect precedence!
Simple Fix: Parentheses!
Lets write a combinator that parses something within (...)
now we can try
calc1 :: Parser Int
calc1 = parens binExp <|> int
binExp :: Parser Int
binExp = do
x <- int
o <- intOp
y <- calc1
return (x `o` y)
now the original string wont even parse
but we can add parentheses to get the right result
>>> runParser calc1 "((10-5)-5)"
[(0 ,"")]
>>> runParser calc1 "(10-(5-5))"
[(10 ,"")]
>>> runParser calc1 "((10*2)+100)"
[(120, "")]
>>> runParser calc1 "(10*(2+100))"
[(1020, "")]
Left Associativity
But how to make the parser left associative
- i.e. parse “10-5-5” as
(10 - 5) - 5
?
Lets flip the order!
calc1 :: Parser Int
calc1 = binExp <|> oneInt
binExp :: Parser Int
binExp = do
x <- calc1
o <- intOp
y <- int
return (x `o` y)
But …
Infinite loop! calc1 --> binExp --> calc1 --> binExp --> ...
- without consuming any input :-(
Solution: Parsing with Multiple Levels
Any expression is a sum-of-products
10 * 20 * 30 + 40 * 50 + 60 * 70 * 80
=>
((((10 * 20) * 30) + (40 * 50)) + ((60 * 70) * 80))
=>
((((base * base) * base) + (base * base)) + ((base * base) * base))
=>
(((prod * base) + prod) + (prod * base))
=>
((prod + prod) + prod)
=>
(sum + prod)
=>
sum
=>
expr
Parsing with Multiple Levels
So lets layer our language as
expr :== sum
sum :== (((prod "+" prod) "+" prod) "+" ... "+" prod)
prod :== (((base "*" base) "*" base) "*" ... "*" base)
base :== "(" expr ")" ORELSE int
that is the recursion looks like
No infinite loop!
expr --> prod --> base -->* expr
but last step
-->*
consumes a(
Parsing oneOrMore
Lets implement oneOrMore vP oP
as a combinator - vP
parses a single a
value - oP
parses an operator a -> a -> a
- oneOrMore vP oP
parses and returns the result ((v1 o v2) o v3) o v4) o ... o vn)
But how?
grab the first
v1
usingvP
- continue by
- either trying
oP
thenv2
… and recursively continue withv1 o v2
orElse
(no moreo
) just returnv1
- either trying
oneOrMore :: Parser a -> Parser (a -> a -> a) -> Parser a
oneOrMore vP oP = do {v1 <- vP; continue v1}
where
continue v1 = do { o <- oP; v2 <- vP; continue (v1 `o` v2) }
<|> return v1
Implementing Layered Parser
Now we can implement the grammar
simply as
where addOp
is +
or -
and mulOp
is *
or /
addOp, mulOp :: Parser (Int -> Int -> Int)
addOp = constP "+" (+) <|> constP "-" (-)
mulOp = constP "*" (*) <|> constP "/" div
constP :: String -> a -> Parser a
constP s x = do { _ <- string s; return x }
Lets make sure it works!
>>> doParse sumE2 "10-1-1"
[(8,"")]
>>> doParse sumE2 "10*2+1"
[(21,"")]
>>> doParse sumE2 "10+2*1"
[(12,"")]
Parser combinators
That was a taste of Parser Combinators
- Transferred from Haskell to many other languages.
Many libraries including Parsec used in your homework - oneOrMore
is called chainl
Read more about the theory - in these recent papers
Read more about the practice - in this recent post that I like JSON parsing from scratch