Rust, Linux, programming languages, type systems

This is why you should never use parser combinators and PEG

Let me tell you why you should (nearly) never use PEG (parsing expression grammars). Nearly everything I will say applies to parser combinators (parsec in Haskell, nom in Rust), too.

So, don't use PEG. Use CFGs (context-free grammars) instead. They are more natural. I feel that CFGs more naturally represent how we think. Thus when you have some language in your head and you try to write it down as a grammar, there is more chances that you do this successfully if you try to write CFG grammar as opposed to PEG grammar.

(All this doesn't apply to situations where you are trying to parse some language, whose specification explicitly says that the language is based on PEG formalism as opposed to CFG formalism. One good example is Python. Python spec says Python is PEG, so you should use PEG parser for parsing Python.)

CFG and PEG are incompatible. I mean that same grammar can have different meaning depending on how you interpret it: as a CFG or as a PEG. Thus decision on whether to use CFG or PEG should happen early. I. e. when you have a grammar in you head and you try to write it down, then before you wrote even single letter to a paper (or to a text file), you should decide whether you are going to write CFG or PEG grammar. (Update 5 June 2024: this is exaggeration: of course, you can change grammar formalism later, I just meant that this is not always easy.)

Okay, you may say: “Please, give me proofs that CFG is more natural than PEG”. Okay, here we go: https://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-tools.html . This article properly summarizes what are the problems with PEG I'm worrying about. Especially this part: “which means that every PEG rule could be hiding a 'conceptual' ambiguity”.

And that problem actually happened to me in practice!

I tried to parse a language described by this grammar:

t0 = t1
t0 = "!!" id "::" t0 "." t0
t1 = t2
t1 = t2 "==>" t1
t2 = t3
t3 = t4
t3 = t4 "::" t0
t3 = "%" id "::" t0 "." t3
t4 = t999
t999 = t1000
t999 = t999 t1000
t1000 = id
t1000 = "(" t0 ")"

(If you are curious, this is grammar for my pure type system checker)

To do so, I wrote a Haskell program using Parsec. I blindly converted grammar above to Haskell code. My code looked like this (you don't need to understand Haskell to see my point):

data Term = Pi String Term Term | ...

t0 = t1 <|> do {
  reservedOp "!!";
  a <- id;
  reservedOp "::";
  b <- t0;
  reservedOp ".";
  c <- t0;
  return (Pi a b c);
}

...

In other words, I simply converted grammar above to code line-by-line. The code worked, all was okay. I wrote many files in my language. But then I accidentally discovered my grammar was ambiguous! Namely, text A :: B ==> C can be parsed two ways: (A :: B) ==> C and A :: (B ==> C). And parsec-based parser somehow chooses one of these parse trees!

So, now I have a lot of files written in this language, but I have no way to ensure they are parsed correctly, i. e. I have no way to ensure my parser chooses parse tree I meant!

Why this happened? This happened, because (as turned out) when I wrote that grammar, I meant it is CFG. Simply because my brain thinks in terms of CFG. But parsec is based on different formalism, which is closer to PEG. Thus CFG grammar was interpreted as something PEG-like, and thus parsec didn't understand what I meant.

And if you agree with me that grammar above in intuitive sense is ambiguous, i. e. if you agree that A :: B ==> C allows two parse trees, then your brain thinks in terms of CFG, too! CFGs are simply more natural.

CFG grammar above is ambiguous, but when interpreted as PEG it is unambiguous, because PEG grammars are always unambiguous. Ambiguity is simply hidden. Josh Haberman writes about this in the article above.

All this gave me psychological trauma (figuratively speaking). I will never use Parsec (and PEG parsers) anymore.

Also, even if you are not convinced, you still should use CFG if you try to parse some language using its specification, and that specification says that the language is CFG. You cannot use CFG grammar as a PEG grammar, these formalisms are incompatible.

Moreover, if some specification simply says “grammar” or “BNF”, then in 99% cases this means it is CFG. (For example, I was unable to find in Rust reference whether presented grammar is CFG or PEG. They simply say “grammar”. But I'm sure they meant CFG, because it is simply a convention to say in specifications “grammar”, when you mean “CFG”.)

Also, CFG grammars themselves are split further into various types: LL grammars, LR grammars, etc. But all these are simply subsets of larger set of CFG grammars. And all they are compatible in the following sense: if some grammar is, say, LL grammar and LR grammar in the same time, then that grammar means the same thing when interpreted as LL and when interpreted as LR. I. e. if you enter that grammar to LL parser and to LR parser, then these two parsers will behave same way. Different parsing libraries, which target different CFG subsets (LL, LR, etc), differ in their algorithm, in scope of supported grammars, but never in semantics of particular grammar (assuming that the grammar itself is supported in both libraries).

So, if you program in Rust, don't use nom, combine and other parser combinator libraries. And don't use PEG parsers. Use CFG-based parsers, such as lalrpop. (Unfortunately, lalrpop is restricted to LR(1) grammars [this is subset of CFG]. And I was unable to find library, which combines lalrpop benefits with support for arbitrary CFGs. It is possible some day I will write such library. If I do, then I will announce it here, so, please, subscribe!)

If you program in Haskell, don't use parsec and its zillions of clones (gigaparsec, attoparsec, megaparsec, trifecta) and most of other combinator parser libraries. Use CFG-based libraries and parser generators, such as Earley and happy. But note that it is possible to write parser combinator library, which is based on unordered choice, i. e. on CFGs. For example, Earley can be considered such a library. Also, these two parser combinator libraries are CFG-based, too: https://hackage.haskell.org/package/base-4.15.0.0/docs/Text-ParserCombinators-ReadP.html#t:ReadS , https://hackage.haskell.org/package/transformers-0.5.5.0/docs/Control-Monad-Trans-Class.html#g:5 . So, parser combinators are not necessary evil.

Unfortunately, Earley crashes on some particular tricky grammar ( https://github.com/ollef/Earley/issues/54 ). Also, there are other things I don't like in Earley and happy. It is possible I will write some better library in the future. If I do, I will announce it here, so, please, subscribe! Also, see my Haskell work below.

That said, CFG is good for languages, which are “structured enough”, “well enough designed”. If you want to parse some ad hoc language, for example, some ad hoc logs, then PEG and parser combinators may be a better solution. CFG is good, when you know the language, which you want to parse, beforehand. Especially if it has specification. CFG parsing is good for “well designed languages”, such as Pascal, Javascript, Rust. It is bad for less structured languages, such as Bash, TeX, C, C++.

I'm biased, because I usually parse languages, which are either designed by me, either are “structured enough” with well written specifications.

Also, if this post gets popular, I want to take chance to advertise my own Haskell parsing libraries. Here we go:

Here are some articles about parsing and CFGs. Their authors are better writers than me.
– Already mentioned https://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-tools.html
https://tratt.net/laurie/blog/2020/which_parsing_approach.html

Subscribe to this blog using button in the bottom of https://safinaskar.writeas.com/ or at Mastodon ( https://types.pl/@safinaskar ). I plan to write about formal methods, Rust and Linux.

You can reach me via email safinaskar@gmail.com

Discuss at:
https://lobste.rs/s/nybhsl/this_is_why_you_should_never_use_parser
https://types.pl/@safinaskar/112553410605946965 (Mastodon)
https://www.reddit.com/r/rust/comments/1d77mpa/this_is_why_you_should_never_use_parser/