Before we continue …
A Word from the Sponsor!
Don't Fear Monads
They are just a versatile abstraction, like map
or fold
.
Parsers
A parser is a function that
- converts unstructured data (e.g.
String
, array ofByte
,…) - into structured data (e.g. JSON object, Markdown, Video…)
type Parser = String -> StructuredObject
Every large software system contains a Parser
System | Parses |
---|---|
Shell Scripts | Command-line options |
Browsers | HTML |
Games | Level descriptors |
Routers | Packets |
Netflix | Video |
Spotify | Audio, Playlists… |
How to build Parsers?
Two standard methods
Regular Expressions
- Doesn’t really scale beyond simple things
- No nesting, recursion
Parser Generators
- Specify grammar via rules
Expr : Var { EVar $1 }
| Num { ENum $1 }
| Expr Op Expr { EBin $1 $2 $3 }
| '(' Expr ')' { $2 }
;
- Tools like
yacc
,bison
,antlr
,happy
- convert grammar into executable function
Grammars Don’t Compose!
If we have two kinds of structured objects Thingy
and Whatsit
.
Thingy : rule { action }
;
Whatsit : rule { action }
;
To parse sequences of Thingy
and Whatsit
we must duplicate the rules
Thingies : Thingy Thingies { ... }
EmptyThingy { ... }
;
Whatsits : Whatsit Whatsits { ... }
EmptyWhatsit { ... }
;
No nice way to reuse the sub-parsers for Whatsit
and Thingy
:-(
A New Hope: Parsers as Functions
Lets think of parsers directly as functions that
- Take as input a
String
- Convert a part of the input into a
StructuredObject
- Return the remainder unconsumed to be parsed later
data Parser a = P (String -> (a, String))
A Parser a
- Converts a prefix of a
String
- Into a structured object of type
a
and - Returns the suffix
String
unchanged
Parsers Can Produce Many Results
Sometimes we want to parse a String
like
"2 - 3 - 4"
into a list of possible results
Minus (Minus 2 3) 4), Minus 2 (Minus 3 4)] [(
So we generalize the Parser
type to
data Parser a = P (String -> [(a, String)])
EXERCISE
Given the definition
data Parser a = P (String -> [(a, String)])
Implement a function
runParser :: Parser a -> String -> [(a, String)]
= ??? runParser p s
QUIZ
Given the definition
data Parser a = P (String -> [(a, String)])
Which of the following is a valid oneChar :: Parser Char
that returns the first Char
from a string (if one exists)
-- A
= P (\cs -> head cs)
oneChar
-- B
= P (\cs -> case cs of
oneChar -> [('', [])]
[] :cs -> [c, cs])
c
-- C
= P (\cs -> (head cs, tail cs))
oneChar
-- D
= P (\cs -> [(head cs, tail cs)])
oneChar
-- E
= P (\cs -> case cs of
oneChar -> []
[] -> [(head cs, tail cs)]) cs
Lets Run Our First Parser!
>>> runParser oneChar "hey!"
'h', "ey")]
[(
>>> runParser oneChar "yippee"
'y', "ippee")]
[(
>>> runParser oneChar ""
[]
Failure to parse means result is an empty list!
EXERCISE
Your turn: Write a parser to grab first two chars
twoChar :: Parser (Char, Char)
= P (\cs -> ???) twoChar
When you are done, we should get
>>> runParser twoChar "hey!"
'h', 'e'), "y!")]
[((
>>> runParser twoChar "h"
[]
QUIZ
Ok, so recall
twoChar :: Parser (Char, Char)
= P (\cs -> case cs of
twoChar :c2:cs' -> [((c1, c2), cs')]
c1-> []) _
Suppose we had some foo
such that twoChar'
was equivalent to twoChar
twoChar' :: Parser (Char, Char)
= foo oneChar oneChar twoChar'
What must the type of foo
be?
A. Parser (Char, Char)
B. Parser Char -> Parser (Char, Char)
C. Parser a -> Parser a -> Parser (a, a)
D. Parser a -> Parser b -> Parser (a, b)
E. Parser a -> Parser (a, a)
EXERCISE: A forEach
Loop
Lets write a function
forEach :: [a] -> (a -> [b]) -> [b]
= ??? forEach xs f
such that we get the following behavior
>>> forEach [] (\i -> [i, i + 1])
[]
>>> forEach [10,20,30] (\i -> [show i, show (i+1)])
"10", "11", "20", "21", "30", "31"] [
QUIZ
What does quiz
evaluate to?
= forEach [10, 20, 30] (\i ->
quiz 0, 1, 2] (\j ->
forEach [+ j]
[i
) )
A. [10,20,30,0,1,2]
B. [10,0,20,1,30,2]
C. [[10,11,12], [20,21,22] [30,31,32]]
D. [10,11,12,20,21,22,30,31,32]
E. [32]
A pairP
Combinator
Lets implement the above as pairP
forEach :: [a] -> (a -> [b]) -> [b]
= concatMap f xs
forEach xs f
pairP :: Parser a -> Parser b -> Parser (a, b)
= P (\s -> forEach (runParser aP s) (\(a, s') ->
pairP aP bP ->
forEach (runParser bP s') (\(b, s'')
((a, b), s'')
) )
Now we can write
= pairP oneChar oneChar twoChar
QUIZ
What does quiz
evaluate to?
= pairP oneChar oneChar
twoChar
= runParser twoChar "h" quiz
A. [((
h,
h), "")]
B. [(
h, "")]
C. [("", "")]
D. []
E. Run-time exception
Does the Parser a
type remind you of something?
Lets implement the above as pairP
data Parser a = P (String -> [(a, String)])
data ST s a = S (s -> (a, s))
Parser
is a Monad!
Like a state transformer, Parser
is a monad!
We need to implement two functions
returnP :: a -> Parser a
bindP :: Parser a -> (a -> Parser b) -> Parser b
QUIZ
Which of the following is a valid implementation of returnP
data Parser a = P (String -> [(a, String)])
returnP :: a -> Parser a
= P (\s -> []) -- A
returnP a
= P (\s -> [(a, s)]) -- B
returnP a
= P (\s -> (a, s)) -- C
returnP a
= P (\s -> [(a, "")]) -- D
returnP a
= P (\s -> [(s, a)]) -- E returnP a
HINT: return a
should just
- “produce” the parse result
a
and - leave the string unconsumed.
Bind
Next, lets implement bindP
- we almost saw it as
pairP
bindP :: Parser a -> (a -> Parser b) -> Parser b
= P (\s ->
bindP aP fbP ->
forEach (runParser aP s) (\(a, s') ->
forEach (runParser (fbP a) s') (\(b, s'')
[(b, s'')]
)
) )
The function
- Builds the
a
values out ofaP
(usingrunParser
) - Builds the
b
values by callingfbP a
on the remainder strings'
- Returns
b
values and the remainder strings''
The Parser
Monad
We can now make Parser
an instance of Monad
instance Monad Parser where
>>=) = bindP
(return = returnP
And now, let the wild rumpus start!
Parser Combinators
Lets write lots of high-level operators to combine parsers!
Here’s a cleaned up pairP
pairP :: Parser a -> Parser b -> Parser (a, b)
= do
pairP aP bP <- aP
a <- bP
b return (a, b)
Failures are the Pillars of Success!
Surprisingly useful, always fails
- i.e. returns
[]
no successful parses
failP :: Parser a
= P (\_ -> []) failP
QUIZ
Consider the parser
satP :: (Char -> Bool) -> Parser Char
= do
satP p <- oneChar
c if p c then return c else failP
What is the value of
= runParser (satP (\c -> c == 'h')) "hellow"
quiz1 = runParser (satP (\c -> c == 'h')) "yellow" quiz2
quiz1 |
quiz2 |
|
---|---|---|
A | [] |
[] |
B | [('h', "ellow")] |
[('y', "ellow")] |
C | [('h', "ellow")] |
[] |
D | [] |
[('y', "ellow")] |
Parsing Alphabets and Numerics
We can now use satP
to write
-- parse ONLY the Char c
char :: Parser Char
= satP (\c' -> c == c')
char c
-- parse ANY ALPHABET
alphaCharP :: Parser Char
= satP isAlpha
alphaCharP
-- parse ANY NUMERIC DIGIT
digitChar :: Parser Char
= satP isDigit digitChar
QUIZ
We can parse a single Int
digit
digitInt :: Parser Int
= do
digitInt <- digitChar -- parse the Char c
c return (read [c]) -- convert Char to Int
What is the result of
= runParser digitInt "92"
quiz1 = runParser digitInt "cat" quiz2
quiz1 |
quiz2 |
|
---|---|---|
A | [] |
[] |
B | [('9', "2")] |
[('c', "at")] |
C | [(9, "2")] |
[] |
D | [] |
[('c', "at")] |
EXERCISE
Write a function
strP :: String -> Parser String
= -- parses EXACTLY the String s and nothing else strP s
when you are done, we should get the following behavior
>>> dogeP = strP "doge"
>>> runParser dogeP "dogerel"
"doge", "rel")]
[(
>>> runParser dogeP "doggoneit"
[]
A Choice Combinator
Lets write a combinator orElse p1 p2
such that
- returns the results of
p1
or, else if those are empty
- returns the results of
p2
:: Parser a -> Parser a -> Parser a
= -- produce results of `p1` if non-empty
orElse p1 p2 -- OR-ELSE results of `p2`
e.g. orElseP
lets us build a parser that produces an alphabet OR a numeric character
alphaNumChar :: Parser Char
= alphaChar `orElse` digitChar alphaNumChar
Which should produce
>>> runParser alphaNumChar "cat"
'c', "at")]
[(
>>> runParser alphaNumChar "2cat"
'2', "cat")]
[(
>>> runParser alphaNumChar "230"
'2', "30")] [(
QUIZ
orElse :: Parser a -> Parser a -> Parser a
= -- produce results of `p1` if non-empty
orElse p1 p2 -- OR-ELSE results of `p2`
Which of the following implements orElse
?
-- a
= do
orElse p1 p2 <- p1
r1s <- p2
r2s return (r1s ++ r2s)
-- b
= do
orElse p1 p2 <- p1
r1s case r1s of
-> p2
[] -> return r1s
_
-- c
= P (\cs ->
orElse p1 p2 ++ runParser p2 cs
runParser p1 cs
)
-- d
= P (\cs ->
orElse p1 p2 case runParser p1 cs of
-> runParser p2 cs
[] -> r1s
r1s )
An “Operator” for orElse
It will be convenient to have a short “operator” for orElse
<|> p2 = orElse p1 p2 p1
A Simple Expression Parser
Now, lets write a tiny calculator!
-- 1. First, parse the operator
intOp :: Parser (Int -> Int -> Int)
= plus <|> minus <|> times <|> divide
intOp where
= do { _ <- char '+'; return (+) }
plus = do { _ <- char '-'; return (-) }
minus = do { _ <- char '*'; return (*) }
times = do { _ <- char '/'; return div }
divide
-- 2. Now parse the expression!
calc :: Parser Int
= do x <- digitInt
calc <- intOp
op <- digitInt
y return (x `op` y)
When calc
is run, it will both parse and calculate
>>> runParser calc "8/2"
4,"")]
[(
>>> runParser calc "8+2cat"
10,"cat")]
[(
>>> runParser calc "8/2cat"
4,"cat")]
[(
>>> runParser calc "8-2cat"
6,"cat")]
[(
>>> runParser calc "8*2cat"
16,"cat")] [(
QUIZ
What will quiz
evaluate to?
calc :: Parser Int
= do x <- digitInt
calc <- intOp
op <- digitInt
y return (x `op` y)
= runParser calc "99bottles" quiz
A. Type error
B. []
C. [(9, "9bottles")]
D. [(99, "bottles")]
E. Run-time exception
Next: Recursive Parsing
Its cool to parse individual Char
…
… but way more interesting to parse recursive structures!
"((2 + 10) * (7 - 4)) * (5 + 2)"
EXERCISE: A “Recursive” String Parser
The parser string s
parses exactly the string s
- fails otherwise
>>> runParser (string "mic") "mickeyMouse"
"mic","keyMouse")]
[(
>>> runParser (string "mic") "donald duck"
[]
Lets fill in an implementation
string :: String -> Parser String
= ??? string s
Which library function will eliminate the recursion from string
?
QUIZ: Parsing Many Times
Often we want to repeat parsing some object
-- | `manyP p` repeatedly runs `p` to return a list of [a]
manyP :: Parser a -> Parser [a]
= m0 <|> m1
manyP p where
= return []
m0 = do { x <- p; xs <- manyP p; return (x:xs) } m1
Recall digitChar :: Parser Char
returned a single numeric Char
What will quiz
evaluate to?
= runParser (manyP digitChar) "123horse" quiz
A. [("" , "1234horse")]
B. [("1" , "234horse")]
C. [("1", "23horse"), ("12", "3horse"), ("123", "horse )]
D. [("123", "horse")]
E. []
Lets fix manyP
!
Run p
first and only return []
if it fails …
-- | `manyP p` repeatedly runs `p` to return a list of [a]
manyP :: Parser a -> Parser [a]
= m1 <|> m0
manyP p where
= return []
m0 = do { x <- p; xs <- manyP p; return (x:xs) } m1
now, we can write an Int
parser as
int :: Parser Int
= do { xs <- manyP digitChar; return (read xs) } int
which will produce
>>> runParser oneChar "123horse"
"123", "horse")]
[(
>>> runParser int "123horse"
123, "horse")] [(
Parsing Arithmetic Expressions
Now we can build a proper calculator!
calc0 :: Parser Int
= binExp <|> int
calc0
int :: Parser Int
= do
int <- many digitChar
xs return (read xs)
binExp :: Parser Int
= do
binExp <- int
x <- intOp
o <- calc0
y return (x `o` y)
Works pretty well!
>>> runParser calc0 "11+22+33"
66,"")]
[(
> doParse calc0 "11+22-33"
ghci0,"")] [(
QUIZ
calc0 :: Parser Int
= binExp <|> int
calc0
int :: Parser Int
= do
int <- many digitChar
xs return (read xs)
binExp :: Parser Int
= do
binExp <- int
x <- intOp
o <- calc0
y return (x `o` y)
What does quiz
evaluate to?
= runParser calc0 "10-5-5" quiz
A. [(0, "")]
B. []
C. [(10, "")]
D. [(10, "-5-5")]
E. [(5, "-5")]
Problem: Right-Associativity
Recall
binExp :: Parser Int
= do
binExp <- int
x <- intOp
o <- calc0
y return (x `o` y)
"10-5-5"
gets parsed as 10 - (5 - 5)
because
The calc0
parser implicitly forces each operator to be right associative
doesn’t matter for
+
,*
but is incorrect for
-
QUIZ
Recall
binExp :: Parser Int
= do
binExp <- int
x <- intOp
o <- calc0
y return (x `o` y)
What does quiz
get evaluated to?
= runParser calc0 "10*2+100" quiz
A. [(1020,"")]
B. [(120,"")]
C. [(120,""), (1020, "")]
D. [(1020,""), (120, "")]
E. []
The calc0
parser implicitly forces all operators to be right associative
doesn’t matter for
+
,*
but is incorrect for
-
Does not respect precedence!
Simple Fix: Parentheses!
Lets write a combinator that parses something within (...)
parensP :: Parser a -> Parser a
= do
parensP p <- char '('
_ <- p
x <- char ')'
_ return x
now we can try
calc1 :: Parser Int
= parens binExp <|> int
calc1
binExp :: Parser Int
= do
binExp <- int
x <- intOp
o <- calc1
y return (x `o` y)
now the original string wont even parse
>>> runParser calc1 "10-5-5"
[]
but we can add parentheses to get the right result
>>> runParser calc1 "((10-5)-5)"
0 ,"")]
[(
>>> runParser calc1 "(10-(5-5))"
10 ,"")]
[(
>>> runParser calc1 "((10*2)+100)"
120, "")]
[(
>>> runParser calc1 "(10*(2+100))"
1020, "")] [(
Left Associativity
But how to make the parser left associative
- i.e. parse “10-5-5” as
(10 - 5) - 5
?
Lets flip the order!
calc1 :: Parser Int
= binExp <|> oneInt
calc1
binExp :: Parser Int
= do
binExp <- calc1
x <- intOp
o <- int
y return (x `o` y)
But …
>>> runParser calc1 "2+2"
...
Infinite loop! calc1 --> binExp --> calc1 --> binExp --> ...
- without consuming any input :-(
Solution: Parsing with Multiple Levels
Any expression is a sum-of-products
10 * 20 * 30 + 40 * 50 + 60 * 70 * 80
=>
10 * 20) * 30) + (40 * 50)) + ((60 * 70) * 80))
((((=>
* base) * base) + (base * base)) + ((base * base) * base))
((((base =>
* base) + prod) + (prod * base))
(((prod =>
+ prod) + prod)
((prod =>
sum + prod)
(=>
sum
=>
expr
Parsing with Multiple Levels
So lets layer our language as
:== sum
expr sum :== (((prod "+" prod) "+" prod) "+" ... "+" prod)
:== (((base "*" base) "*" base) "*" ... "*" base)
prod :== "(" expr ")" ORELSE int base
that is the recursion looks like
= sum
expr sum = oneOrMore prod "+"
= oneOrMore base "*"
prod = "(" expr ")" <|> int base
No infinite loop!
expr --> prod --> base -->* expr
but last step
-->*
consumes a(
Parsing oneOrMore
Lets implement oneOrMore vP oP
as a combinator
- vP
parses a single a
value
- oP
parses an operator a -> a -> a
- oneOrMore vP oP
parses and returns the result ((v1 o v2) o v3) o v4) o ... o vn)
But how?
grab the first
v1
usingvP
continue by
- either trying
oP
thenv2
… and recursively continue withv1 o v2
orElse
(no moreo
) just returnv1
- either trying
oneOrMore :: Parser a -> Parser (a -> a -> a) -> Parser a
= do {v1 <- vP; continue v1}
oneOrMore vP oP where
= do { o <- oP; v2 <- vP; continue (v1 `o` v2) }
continue v1 <|> return v1
Implementing Layered Parser
Now we can implement the grammar
= sum
expr sum = oneOrMore prod "+"
= oneOrMore base "*"
prod = "(" expr ")" <|> int base
simply as
= sum
expr sum = oneOrMore prod addOp
= oneOrMore base mulOp
prod = parens expr <|> int base
where addOp
is +
or -
and mulOp
is *
or /
mulOp :: Parser (Int -> Int -> Int)
addOp,= constP "+" (+) <|> constP "-" (-)
addOp = constP "*" (*) <|> constP "/" div
mulOp
constP :: String -> a -> Parser a
= do { _ <- string s; return x } constP s x
Lets make sure it works!
>>> doParse sumE2 "10-1-1"
8,"")]
[(
>>> doParse sumE2 "10*2+1"
21,"")]
[(
>>> doParse sumE2 "10+2*1"
12,"")] [(
Parser combinators
That was a taste of Parser Combinators
- Transferred from Haskell to many other languages.
Many libraries including Parsec used in your homework
- oneOrMore
is called chainl
Read more about the theory - in these recent papers
Read more about the practice - in this recent post that I like JSON parsing from scratch