Pratt Parsing

12 Aug 2023, Ochawin A.

I was first introduced to the concept of Pratt parsing (top down operator precedence) while reading Bob Nystrom’s remarkable crafting interpreters, a book which I think everyone should read at least once in their career. Pratt parsing, named after its creator Vaughan Pratt, is a powerful technique used in programming language parsing.

In this post, I will be demonstrating the concept of Pratt parsing by showing you how we can implement it for a basic expression calculator.

Admittedly it has been quite some time since I’ve done any parsing, but I’m hoping to brush off the dust and refresh my memory and hopefully be able to cement the concept for good after this post.

Now with the introduction out of the way, let’s dive straight in.

1. What is Parsing

So before anything, I’d like to make sure we’re on the same page here. It’s important you understand what parsing is.

Parsing is the process of taking apart a sentence to understand its structure using certain rules. Imagine you have a bunch of words, and you want to arrange them in a way that makes sense. You follow some rules to do this, and once you’re done, you can see patterns in how the words fit together, making the sentence meaningful.

I think the concept can be best explained with an example. Consider the following expression: $$ V = 2 \cdot 3 + 3. $$ What is the resulting value of V here?

You see, in this example there is actually two different ways we can evaluate the expression which will give us two very different results.

Way A: $$ (2 \cdot 3) + 3 = 9, $$ Way B: $$ 2 \cdot (3 + 3) = 12. $$

So… what’s the right answer? Well, with respect to BIDMAS (or PEMDAS), we know that multiplication $(2 \cdot 3)$ takes precedence over the addition operation $(3 + 3)$, which results in expression A.

It might not be obvious, but internally, what we just did was parsing the expression using the grammatical rules of BIDMAS.

So, in short, parsing is kind of like the process of stitching a bunch of symbols together, guided by a set of rules to produce a result which is meaningful in the language of the grammar it conforms to. It is a translation.

                                                         (+)
                                          Parse          / \
                             2 * 3 + 3    ----->       (*)   3
                                                       / \
                                                      2   3
Note: The top of the expression/parse tree is evaluated last.
           As we traverse down the tree, the precedence level increases.

So before I get into the actual meat of this post, let me just say, parsing is very important, especially in programming language processing. It is responsible for translating our human-readable source code into machine-executable instructions. In fact, without effective parsing we would not be able to enjoy the convenience of high level programming languages.

2. The Goal

To demonstrate how the Pratt parser works, we’re going to be building a basic Pratt parser for evaluating math expressions from scratch. I could bore you with heaps of nitty-gritty details about the technique but that’s probably not what you’re here for, and quite frankly, neither of us would enjoy that. So instead, I will sprinkle relevant details I think you should know at each stage of the implementation.

Let’s get started.

3. Lexical Analysis

Since I’m building everything from complete scratch, I’m going to have to build a lexer first. The role of the lexer is to simply convert raw strings of the source code into tokens. The reason we need to tokenize everything is because by classifying each component, they become easier to work with.

Since this is only for demonstrative purposes I’m going to keep it as simple as possible.

Our token will only contain the token type and its string value (referred to as the lexeme).

Defining tokens
from enum import Enum
from dataclasses import dataclass

# All different token types we have:
# 1. Number
# 2. Operator
# 3. Parenthesis
# 4. Illegal & End Of File
TokenType = Enum(
    "Token",
    ["ILLEGAL", "NUMBER", "PLUS", "MINUS", "MULT", "DIV", "POW", "LPAREN", "RPAREN", "EOF"],
)

# A token is just a dataclass holding the
# type of the token and it's raw string value.
@dataclass
class Token:
    token_type: TokenType
    lexeme: str

# The End of File token. We have to place this manually.
EOF_TOKEN = Token(TokenType.EOF, "")

Now that we have defined our tokens, we can make a function to match and return an appropriate token given an input string.

The tokenizer
def isnumber(input: str) -> bool:
    """
    Returns true if the input represents an integer or float.
    """
    return input.replace(".", "").isdigit()

def tokenize(input: str) -> Token:
    """
    Given a string, return its relevant token.
    """
    token_type: TokenType

    if input == "+":
        token_type = TokenType.PLUS
    elif input == "-":
        token_type = TokenType.MINUS
    elif input == "/":
        token_type = TokenType.DIV
    elif input == "*":
        token_type = TokenType.MULT
    elif input == "(":
        token_type = TokenType.LPAREN
    elif input == ")":
        token_type = TokenType.RPAREN
    elif input == "^":
        token_type = TokenType.POW
    elif isnumber(input):
        token_type = TokenType.NUMBER
    else:
        token_type = TokenType.ILLEGAL

    return Token(token_type, input)

Now implementing our (crude) lexer becomes trivial.

The lexer
def lex(input: str) -> List[Token]:
    """
    Given an input, split all words by a single
    space, trim and tokenize each of the words.
    Note: The EOF token is manually appended.
    """
    return [tokenize(word.strip()) for word in input.split(" ")] + [EOF_TOKEN]
Note: With this approach, the usage of split means that every single symbol has to    be separated by a single space. It isn't ideal but this can be fixed later.

So now we can properly tokenize any given expression like this:

Tokenizing an expression
input = "1 + ( 2 * 3 ) - 4 / 2"
token_stream = lex(input) 
# [Token(token_type=<Token.NUMBER: 2>, lexeme='1'), Token(token_type=<Token.PLUS: 3>, ...

Superb, everything is almost ready, we just have to set up our parser class now. Here are the methods we need:

  • A method to read the current token
  • A method to increment the position of the current token
  • A method to return the next token
  • A method to check whether our current token matches what we expect
The parser class
class Parser:

    # Lex the input source and store
    def __init__(self, input: str):
        self.current = 0
        self.tokens = lex(input)

    # Returns the current token
    def read(self) -> Token:
        return self.tokens[self.current]

    # Returns the current token and increment the position
    def advance(self) -> Token:
        token = self.read()
        self.current += 1
        return token

    # Returns the next token
    def peakNext(self) -> Token:
        return self.tokens[self.current + 1]

    # Returns whether the current token matches the expected token
    def expect(self, token: TokenType) -> bool:
        return self.read().token_type == token 

Alright great, now our parser is sufficient.

4. Prefix Expressions

Alas, we’ve reached the first boss. Incase you’re unfamiliar, a prefix expression is either a number 2, 5, 10, or an unary expression which is an expression preceded by an operator -2, -5, -(-10). In Pratt parsing, this is the first thing you parse, so let’s setup to do exactly that.

The parser class
Expression = Any

class Parser:
    # Code omitted...

    def parseExpression(self) -> Expression:
        token = self.advance()

        # Parse the prefix expression based on current token.
        expr = self.parsePrefixExpression(token)

        # We always expect an expression, if there isn't one then it is an error.
        if expr is None:
            return None

        return expr
        

Constant Expressions

Great, now we can start with parsing the simplest form of an expression: constant expressions (2, 5, 10). We’ll begin by defining a new type.

Defining the ConstantExpression type
Expression = Any

@dataclass 
class ConstantExpression:
    expr: float

    # For debugging.
    def __str__(self):
        return self.expr

Parsing a constant value is trivial, we just have to convert the lexeme into its numerical value.

The parser class
def parseConstantExpression(self, token: Token) -> ConstantExpression:
    return ConstantExpression(float(token.lexeme))

Now we just have to wire it into our parsePrefixExpression method.

The parser class
def parsePrefixExpression(self, token: Token) -> Expression:
    match token.token_type:
        case TokenType.NUMBER: # Constant Expression
            return self.parseConstantExpression(token)
        case _: # Unexpected token found
            print(f"Error, no matching prefix rule found for: {token}")
            return None

And now we can actually parse numbers! Let’s give it a try.

Parsing constant expressions
Parser("2").parseExpression()   # ConstantExpression(expr=2.0)
Parser("42").parseExpression()  # ConstantExpression(expr=42.0)
Parser("900").parseExpression() # ConstantExpression(expr=900.0)

Unary Expressions

Now let’s move onto parsing unary expressions. Again, we have to add a new type.

Defining the UnaryExpression type
@dataclass
class UnaryExpression:
    operator_token: Token
    expr: Expression

    # For debugging.
    def __str__(self):
        return f"{self.operator_token.lexeme}({self.expr})"

Notice that the unary expression object consists of two members, one holding the prefix operator token (+, -, etc) and the other one holding the expression it applies to.

Getting the operator token is trivial, we can extract it straight from the current token we’re reading. But now here’s the real question, how on Earth will we work out the expression? If only we had a function which could figure it out for us…

Oh! There is nothing stopping us from calling parseExpression! We can just throw the work of figuring out the expression to it. Now parsing unary expressions becomes trivial.

The parser class
def parseUnaryExpression(self, token: Token) -> UnaryExpression:
    return UnaryExpression(token, self.parseExpression())
Now we wire it back.

Note: We can easily support more unary operators by just adding more cases.
The parser class
def parsePrefixExpression(self, token: Token) -> Expression:
    match token.token_type:
        # ...
        case TokenType.MINUS: # Unary Expression
            return self.parseUnaryExpression(token)
        # ...

Great, now we can try parsing some unary expressions.

Parsing unary expressions
a = Parser("- 2").parseExpression()
# a = UnaryExpression(
#       operator_token=<Token.MINUS: 4>,  
#       expr=ConstantExpression(expr=2.0)
# )

print(a) # -(2)

b = Parser("- - 42").parseExpression()
# b = UnaryExpression(
#       operator_token=<Token.MINUS: 4>, 
#       expr=UnaryExpression(
#               operator_token=<Token.MINUS: 4>, 
#               expr=ConstantExpression(expr=42.0)
#       )
# )

print(b) # -(-(42))

Now that the expression tree can be constructed properly, let’s take it a step further and actually be able to evaluate the expression.

Adding a method to evaluate prefix expressions
@dataclass
class ConstantExpression:
    # ...
    # Return the constant value
    def value(self):
        return self.expr

@dataclass
class UnaryExpression:
    # ...
    def value(self):
        match self.operator_token:
            # Negate the result of the expression
            case TokenType.MINUS:
                return -self.expr.value()
            case _:
                print(f"Error! Can't evaluate the value for {self.operator_token}")
                return None

As you can see the evaluation process is very simple. The constant expression is our base case, the call to value() will simply return the integer which it holds. As for the unary expression, we simply have to call value() on the expression and apply the operator.

Evaluating prefix expressions
Parser("- 2").parseExpression().value()       # -2.0
Parser("- - 42").parseExpression().value()    # 42.0
Parser("- - - 900").parseExpression().value() # -900.0

And now we’re done, yep that’s it for prefixes.

5. Infix Expressions

It’s time for the second boss, the infix expression. Again, incase you’re unfamiliar, an infix expressions is an expression with an operator in between two operands.

As a first step, we are aiming to parse the following expression: 2 + 2.

Let’s see how our parser fares so far.

Evaluating infix expressions
Parser("2 + 2").parseExpression() # ConstantExpression(expr=2.0)

To no surprise, it isn’t working properly, but thankfully we can clearly see why. The problem is that we stop parsing right after we have finished with the prefix expression.

Let’s fix that.

Once again, we begin by defining a new expression type.

Defining the BinaryExpression Type
@dataclass
class BinaryExpression:
    lhs: Expression
    operator_token: Token
    rhs: Expression

    # For debugging.
    def __str__(self):
        return f"({self.lhs} {self.operator_token.lexeme} {self.rhs})"
Now we can modify our parseExpression method to check whether the current token is EOF or not. If it isn’t then we parse the infix expression.

The parser class
def parseExpression(self) -> Expression:
    # Parse prefix...

    # Parse infix expression
    if !self.expect(TokenType.EOF):    
        token = self.advance()
        expr = self.parseInfixExpression(token, expr)

    return expr

The parseInfixExpression method is almost a mirror of the parsePrefixExpression, the only difference is that we need to also pass in the left hand side expression.

The parser class
def parseInfixExpression(self, token: Token, expr: Expression):
    match token.token_type:
        case: ... # all binary tokens... (+ - ^ * /)
            return BinaryExpression(
                expr, 
                token, 
                self.parseExpression()
            )
        case _:
            print(f"Error, no matching infix rule found for: {token}")
            return None

Ha, wasn’t that easy? Let’s try parsing again.

Evaluating infix expressions
print(Parser("2 + 2").parseExpression())             # (2 + 2)
print(Parser("2 + 2 + 3").parseExpression())         # (2 + (2 + 3))
print(Parser("1 + 2 + 3 + 4 + 5").parseExpression()) # (1 + (2 + (3 + (4 + 5))))

Great now that we can parse infixes, let’s make a method for evaluating them.

Adding a method to evaluate binary expressions
@dataclass
class BinaryExpression:
    # ...

    def value(self):
        match self.operator_token.token_type:
            case TokenType.PLUS:
                return self.lhs.value() + self.rhs.value()
            case TokenType.MINUS:
                return self.lhs.value() - self.rhs.value()
            case TokenType.MULT:
                return self.lhs.value() * self.rhs.value()
            case TokenType.DIV:
                return self.lhs.value() / self.rhs.value()
            case TokenType.POW:
                return self.lhs.value() ** self.rhs.value()
            case _:
                print(f"Error! Can't evaluate the value for {self.operator_token}")
                return None
Evaluating infix expressions
print(Parser("2 + 2").parseExpression().value())             # 4.0
print(Parser("2 + 2 + 3").parseExpression().value())         # 7.0
print(Parser("1 + 2 + 3 + 4 + 5").parseExpression().value()) # 15.0

Okay, but we have a new problem now, consider the following:

Not handling infix precedence
a = Parser("2 * 3 + 3").parseExpression()
# BinaryExpression(
#     lhs=ConstantExpression(expr=2.0), 
#     operator_token=<Token.MULT: 5>, 
#     rhs=BinaryExpression(
#         lhs=ConstantExpression(expr=3.0),
#         operator_token=<Token.PLUS: 3>, 
#         rhs=ConstantExpression(expr=3.0)
#     )
# )

print(a) # (2 * (3 + 3))  <-- Oh no!

As you can see the parser is grouping all operands to the right, it is treating all operators as right-associative. To fix this, we have to introduce the concept of operator precedence to the parser.

6. Handling Precedence

Allow me to introduce you to the precedence table, it should look familiar.

The precedence table
# Precedence: Low -> High
# This is just BIDMAS
class Precedence(IntEnum):
    NONE           = 0 # Reserved for parsing termination.
    SUBTRACTION    = 1
    ADDITION       = 2
    MULTIPLICATION = 3
    DIVISION       = 4
    INDICES        = 5

def getPrecedence(token: TokenType) -> Precedence:
    """
    Given an infix operator, return its precedence level.
    """

    match token:
        case TokenType.PLUS:
            return Precedence.ADDITION
        case TokenType.MINUS:
            return Precedence.SUBTRACTION
        case TokenType.MULT:
            return Precedence.MULTIPLICATION
        case TokenType.DIV:
            return Precedence.DIVISION
        case TokenType.LPAREN:
            return Precedence.BRACKET
        case TokenType.POW:
            return Precedence.INDICES
        case _:
            return Precedence.NONE
Note: The precedence table will differ depending on what you are parsing.

Now we refactor the parseExpression method to pass in the precedence as a parameter. We will use it to determine whether we parse the infix or not.

The parser class
# Note: I am setting a default value so parsePrefixExpression does not throw an error.
def parseExpression(self, precedence: Precedence = Precedence.SUBTRACTION) -> Expression:
    # Parse prefix expression...

    # We will parse the infix IF the precedence of the 
    # operator is higher or equal to the current context's precedence.
    if precedence <= getPrecedence(self.read().token_type):
        token = self.advance()
        expr = self.parseInfixExpression(token, expr)

    return expr

We have to refactor parseInfixExpression to use the infix operator’s precedence.

The parser class
def parseInfixExpression(self, token: Token, expr: Expression):
    match token.token_type:
        case TokenType.PLUS | TokenType.MINUS | ...:
            return BinaryExpression(
                expr,
                token,
                self.parseExpression(getPrecedence(token.token_type)),
            )
        case _:
            ...

While we’re at it, lets also create a new entry point for parsing.

The parser class
# We parse starting from the lowest valid precedence possible. 
# This means that it will only stop once we reach EOF, since Precedence.NONE is lower.
def parse(self) -> Expression:
    return self.parseExpression(Precedence.SUBTRACTION)

Now we can try to parse again.

Evaluating infix with proper priority
Parser("2 * 3 + 3").parse()
# BinaryExpression(
#     lhs=ConstantExpression(expr=2.0), 
#     operator_token=<Token.MULT: 5>,     ---->    (2 * 3)
#     rhs=ConstantExpression(expr=3.0)
# )

Great, now we can prioritize grouping 2 * 3 over 3 + 3. But where is the + 3?

Well, after the first infix is parsed (2 * 3) the context’s precedence level became MULTIPLICATION. The next infix 3 + 3 has the precedence level of ADDITION which is lower than MULTIPLICATION so it isn’t parsed.

The way we can fix this is quite simple, we just have to be able to fall back down to the last context which can parse it.

The parser class
def parseExpression(self, precedence: Precedence) -> Expression:
    # Parse prefix expression...

    # Parse infix expression
    while precedence <= getPrecedence(self.read().token_type): # if --> while
        ...

    return expr
By turning the process of parsing an infix into a while loop, we can now return to the previous context once the current one is terminated.

Note: We will always be able to parse since we can always fallback onto the initial          `parseExpression` context which has the lowest precedence level.

And now we can try to parse our expression again.

Evaluating infix with proper precedence handling
Parser("2 * 3 + 3").parse()
# BinaryExpression(
#     lhs=BinaryExpression(
#         lhs=ConstantExpression(expr=2.0),
#         operator_token=<Token.MULT: 5>, 
#         rhs=ConstantExpression(expr=3.0)    ---->    ((2 * 3) + 3)
#     ), 
#     operator_token=<Token.PLUS: 3>, 
#     rhs=ConstantExpression(expr=3.0)
# )

7. Prefix Precedence

At the moment, we are parsing prefix expressions with the precedence level of Precedence.SUBTRACTION which is the default value of parseExpression. This isn’t ideal because the parseExpression call of parseUnaryExpression will overextend and grab more than it should.

Consider the following example:

Not handling prefix precedence
a = Parser("- 2 + 2").parse()
# UnaryExpression(
#   operator_token=<Token.MINUS: 4>,
#   expr=BinaryExpression(
#           lhs=ConstantExpression(expr=2.0),
#           operator_token=<Token.PLUS: 3>, 
#           rhs=ConstantExpression(expr=2.0)
#       )
#   )

print(a) # -((2 + 2))  <-- Wrong

In order to support proper prefix precedence, we’ll add a new precedence level. Since we know that indices takes precedence over the prefix, we’ll insert it right before Precedence.INDICES.

The precedence table
class Precedence(IntEnum):
    # ...
    DIVISION       = 4
    PREFIX         = 5
    INDICES        = 6
    # ...

Now we refactor parseUnaryExpression to call parseExpression with precedence.PREFIX.

The parser class
def parseUnaryExpression(self, token: Token) -> UnaryExpression:
    return UnaryExpression(token, self.parseExpression(Precedence.PREFIX))

And that’s it! Yep, that’s all we had to do it fix it.

Evaluating with proper prefix precedence
print(Parser("- 2 + 2").parse()) # (-(2) + 2)

8. Grouped Expressions

Here is our final boss. Grouped expressions are exactly as the name implies, they are expressions which are grouped together via parentheses. E.g. -(2+2), (2+3)*3.

Thankfully with the experience we have so far, we are quite well equipped to deal with them.

The first observation we can make is that the ( operator appears before the operand, we know right away that this must be a type of prefix expression. So let’s wire it up.

The parser class
# We wire it here.
def parsePrefixExpression(self, token: Token) -> Expression:
    match token.token_type:
        # ...
        case TokenType.LPAREN:
            return self.parseGroupedExpression()
        # ...

And here comes the implementation.

The parser class
def parseGroupedExpression(self):
    # Parse from the lowest valid precedence. We will consume everything up 
    # until we reach either `)` or `EOF` is reached. (They both return `Precedence.NONE`)
    expr = self.parseExpression(Precedence.SUBTRACTION)

    # After we finish parsing the expression, we expect the current token to be `)`.
    if self.expect(TokenType.RPAREN):
        # We consume the `)` token so the context we jump back to can 
        # continue parsing properly, starting from the next token.
        self.advance()
        return expr
    else:
        print("Error: Missing closing parenthesis")
        return None

This blew my mind when I first saw it.

Here’s a little explanation. Since we know that the expression should be able to parse everything starting from the opening parenthesis up until the closing parenthesis, we have start our precedence at the very bottom (SUBTRACTION!). We know that the parsing process for the expression should only stop once we reach the matching ) token, this means that we just have to set the precedence of the closing parenthesis to NONE (Which is already true thanks to our default case in getPrecedence).

And with that we are officially done with our parser!

Remarks

Writing this was much more difficult than I thought it would be, but it was also quite enjoyable. More than anything, I hope that you’ve found it resourceful.

I apologize if you found the format to be a little annoying to follow. I tend to begin with an incomplete implementation which partially works then slowly correct it, rather than just showing you the correct implementation right away. Doing so requires quite a lot of jumping around and refactoring which might be annoying, but I think that slowly building up to the solution is much better for learning than just giving one.

On a side note, images and figures aren’t properly integrated yet so most of the content is just a wall of text, but you can expect that to change in newer posts.

Thankyou for reading!


Comments