Writing a simple parser
This article explains how to write a simple parser, using symbolic expressions (s-expression, known from Lisp and Scheme) as an example. At the same time, it's a brief introduction to Lean.
A simple s-expression is, for example, (1 2 3)
, which is a list with three elements. However, if the input is invalid, (1 2 3
say, we want to return an error. The following data type will represent parse errors:
inductive ParseError : Type where | mk : String -> Nat -> ParseError instance : ToString ParseError where toString : ParseError -> String | .mk message line => s!"{message} (line {line})"
The String
field contains the error message and the Nat
field (which is short for natural number, a non-negative integer) contains the line number where the error occurred. We also implement the ToString
type class for this data type so we can pretty-print its contents.
When the parsing succeeds, the result is a value (for example, the input's abstract syntax tree, AST). Otherwise we return an instance of ParseError
:
def ParseResult : Type -> Type := fun a => Result a ParseError instance [ToString a] [ToString b] : ToString (ParseResult a) where toString : ParseResult a -> String | .value val => toString val | .error err => toString err instance : Pure ParseResult where pure x := .value x instance : Bind ParseResult where bind r f := match r with | .value val => f val | .error err => .error err instance [Inhabited a] : Inhabited (ParseResult a) where default := .value default
Again, we implement the ToString
type class, and also Pure
, Bind
and Inhabited
, highly technical type classes we won't go into here.
The value of an s-expression can be an identifier, an integer, a float, a string, a list of s-expressions or an empty list, represented by the nil
value:
inductive Value : Type where | ident : String -> Nat -> Value | int : Nat -> Nat -> Value | float : Float -> Nat -> Value | string : String -> Nat -> Value | list : List Value -> Nat -> Value | nil : Nat -> Value mutual def listAsString : List Value -> String | [] => "" | x :: y :: xs => x.toString ++ " " ++ listAsString (y :: xs) | x :: _ => x.toString def Value.toString : Value -> String | .ident t _ => t | .int n _ => s!"{n}" | .float x _ => s!"{x}" | .string s _ => escape s | .list l _ => s!"({listAsString l})" | .nil _ => "()" end instance : ToString Value where toString := Value.toString instance : Inhabited Value where default := .nil 0
We finally get to the code that does the actual parsing. The input of the function is a list of tokens (the input code processed by a lexer). The output is either a parse error or a parsed value and a list of the remaining tokens.
mutual partial def parseValue (tokens : List Token) : ParseResult (Value × List Token) := match tokens with | .ident text line :: tail => .value (.ident text line, tail) | .str text line :: tail => .value (.string text line, tail) | .num text1 line :: .symbol "." _ :: .num text2 _ :: tail => .value (.float (parseFloat $ text1 ++ "." ++ text2) line, tail) | .num text line :: tail => .value (.int text.toNat! line, tail) | .symbol "(" line :: tail => parseList tail >>= fun (lst, tail) => .value (if lst.isEmpty then .nil line else .list lst line, tail) | .symbol text line :: tail => .value (.ident text line, tail) | .eof line :: _ => .error $ ParseError.mk "unexpected end of file" line | _ => .error $ ParseError.mk "no tokens" 0 partial def parseList (tokens : List Token) : ParseResult (List Value × List Token) := match tokens with | .symbol ")" _ :: tail => .value ([], tail) | _ => parseValue tokens >>= fun (val, tail) => parseList tail >>= fun (lst, tail) => .value (val :: lst, tail) end
Note that the parser 1) is recursive (there can be many nested lists) and 2) works by pattern matching. For example, if the parser sees .symbol "(" line
at the beginning of the token list, it recursively calls the parseList
function.
The main
function reads the input file into a string and calls the parseValue
function:
def main (args : List String): IO UInt32 := do if args.length == 0 then IO.eprintln "no input file specified" return 1 let filename := args[0]! if not (<- FilePath.pathExists filename) then IO.eprintln "input file doesn't exist" return 1 let input <- IO.FS.readFile filename let tokens := tokenise input -- IO.println s!"{tokens}" match parseValue tokens with | .value (val, _) => IO.println s!"parsed: {val}" | .error err => IO.eprintln s!"parse error: {err}" return 0
For (1 2 3)
, the parser succeeds. For an ill-formed input, (1 2 3
say, the output is parse error: unexpected end of file (line 2)
.
This post was rather technical but it illustrates how pattern matching makes writing parsers for formal languages easier.