Fast Scanning: Detecting keywords

Hash Tables vs. Tries for keyword detection

Sep 30, 2023

When scanning a source file, the scanner (or lexer) needs to distinguish reserved keywords, such as var, for, or, and, class from variable names like foo, bar. While working my way through Crafting Interpreters, I stumbled upon a simple optimization that uses tries implemented with a switch statement instead of hashmaps. I wanted to verify if this really is faster.

I’ve included relevant code listings below. You can see the full repository here: https://github.com/MichalPitr/hashmap_vs_trie_c

Here’s the setup. We have a sliding window scanner that reads the source code and emits Tokens which are easier to work with. To illustrate, the source code

var name = “Michael”;

would be converted to the following tokens

TOKEN_VAR, TOKEN_IDENTIFIER, TOKEN_EQUAL, TOKEN_STRING, TOKEN_SEMICOLON

Each token would have associated metadata, like start and length so that we can retrieve the literal string from the source code.

When the scanner sees a lexeme (a sequence of characters), it decides what Token it is. { maps to TOKEN_LEFT_BRACE, == maps to TOKEN_EQUAL_EQUAL.

How does it go about differentiating keywords and identifiers, such as variable names or function names? The rest of this write-up compares two approaches: Hash tables and tries.

Hash Tables

Let’s start with hash tables (HashMap in Java, dictionary in Python, unordered_map in C++) since they are the intuitive go-to solution. I use simple chaining hash tables. When the scanner starts, we can create a hash table where keywords map to their TokenType. Then whenever the scanner finds a lexeme that is either a keyword or an identifier, it looks it up in the hash table.

The listing below shows a function TokenType identifierTypeUsingHashMap() that returns the TokenType of the current lexeme pointed to by scanner.start. If the lexeme isn’t a reserved keyword, it returns TOKEN_IDENTIFIER.

TokenType identifierTypeUsingHashMap() {
    int length = scanner.current - scanner.start;
    TokenType type = hashMapGet(&scanner.keywords, scanner.start, length);
    return type;
}

Let’s have a look at TokenType hashMapGet() to understand how it works:

It hashes the key, which requires a full traversal of the string
Looks up the corresponding bucket in the hash table
Due to possible collision, it checks if the key matches the bucket’s key. This again requires a full traversal of the string. If there’s a match, it returns the keyword’s TokenType.
If this is a collision, it checks the next value in the bucket’s linked list.
If it reaches the end of the linked list, we know that the key is an identifier, not a keyword.

typedef struct HashNode {
    char* key;
    int keyLength;
    TokenType value;
    struct HashNode* next;
} HashNode;

typedef struct {
    HashNode* table[HASHMAP_CAPACITY];
} HashMap;

TokenType hashMapGet(HashMap* map, const char* key, int length) {
    uint32_t index = hash(key, length);
    HashNode* node = map->table[index];
    while (node) {
        if (node->keyLength == length && strncmp(node->key, key, length) == 0) {
            return node->value; 
        }
        node = node->next;
    }
    return TOKEN_IDENTIFIER;
}

This seems reasonably fast. Especially if the hash function distributes keys uniformly in the hash table to avoid expensive chaining. We can control this through the load factor (entries/capacity).

We are traversing the string multiple times, but when we consider the lexeme “formula”, once the scanner reads “form” it already knows that there is no keyword with the prefix “form”. Can we leverage it? That’s exactly the approach Crafting Interpreters takes.

Tries

Tries are a type of n-ary tree that can be used to efficiently represent a set of words. Any path from the root to the leaf constitutes a valid word in the vocabulary. Here’s an illustration for the following keywords: “false”, “for”, and “fun”.

Let’s represent all keywords with a trie and implement it with a switch-case statement. It’s pretty simple:

Check if the first letter of the current lexeme is also the first letter of any keyword. This is equivalent to starting from “Start” in the tree diagram above. If not, it immediately knows that the lexeme is an identifier without any extra work.
If the first character belongs to some keyword, it further checks which keyword it could be or compares the rest of the string if there’s only one option left.

static TokenType checkKeyword(int start, int length, const char* rest, TokenType type) {
    if (scanner.current - scanner.start == start + length && 
        memcmp(scanner.start + start, rest, length) == 0) {
        return type;
    } 
    return TOKEN_IDENTIFIER;
}

static TokenType identifierType() {
    switch (scanner.start[0]) {
        case 'a': return checkKeyword(1, 2, "nd", TOKEN_AND);
        case 'c': return checkKeyword(1, 4, "lass", TOKEN_CLASS);
        case 'e': return checkKeyword(1, 3, "lse", TOKEN_ELSE);
        case 'f':
            if (scanner.current - scanner.start > 1) {
                switch (scanner.start[1]) {
                    case 'a': return checkKeyword(2, 3, "lse", TOKEN_FALSE);
                    case 'o': return checkKeyword(2, 1, "r", TOKEN_FOR);
                    case 'u': return checkKeyword(2, 1, "n", TOKEN_FUN);
                }
            }
            break;
        case 'i': return checkKeyword(1, 1, "f", TOKEN_IF);
        case 'n': return checkKeyword(1, 2, "il", TOKEN_NIL);
        case 'o': return checkKeyword(1, 1, "r", TOKEN_OR);
        case 'p': return checkKeyword(1, 4, "rint", TOKEN_PRINT);
        case 'r': return checkKeyword(1, 5, "eturn", TOKEN_RETURN);
        case 's': return checkKeyword(1, 4, "uper", TOKEN_SUPER);
        case 't':
            if (scanner.current - scanner.start > 1) {
                switch(scanner.start[1]) {
                    case 'h': return checkKeyword(2, 2, "is", TOKEN_THIS);
                    case 'r': return checkKeyword(2, 2, "ue", TOKEN_TRUE);
                }
            }
        case 'v': return checkKeyword(1, 2, "ar", TOKEN_VAR);
        case 'w': return checkKeyword(1, 4, "hile", TOKEN_WHILE);
    }

    return TOKEN_IDENTIFIER;
}

TokenType checkKeyword() compares the rest of the two strings with a call to memcmp.

This has several nice properties. For many identifiers, it can determine that it is not a keyword by looking at just the first character. In comparison, the hash table approach had to hash the lexeme every time.

Glue

The driving code doesn’t do anything useful, it simply loads a source file and scans it. Normally it would hand off the Tokens to a parser/compiler, but here I just wanted to focus on the scanner’s performance.

It reads the source file to memory and hands it off to the scanner. We time how long it takes the scanner to finish scanning.

static void runFile(const char* path) {
    char* source = readFile(path);
    initScanner(source);
    // Prime the scanner.
    Token token;
    
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    // Scans the source file.
    while (token.type != TOKEN_EOF) {
        advance(&token);
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    double elapsed = (end.tv_sec - start.tv_sec) + 
                 ((end.tv_nsec - start.tv_nsec)/1e9);
    printf("%f\n", elapsed);
    free(source);
}

int main(int argc, const char* argv[]) {
    if (argc == 2) {
        runFile(argv[1]);
    } else {
        fprintf(stderr, "Usage: clox [path]\n");
        exit(64);
    }
    
    return 0;
}

For completeness, the advance() function just scans the next token. This scanToken() is another switch-case statement that given a character (with some lookahead) decides what token it is. identifier() is what decides if a lexeme is a keyword or an identifier!

Token scanToken() {
    skipWhitespace();
    scanner.start = scanner.current;

    if (isAtEnd()) return makeToken(TOKEN_EOF);

    char c = advance();
    // identifier() uses either the hash map or trie to determine the type
    if (isAlpha(c)) return identifier();
    if (isDigit(c)) return number();

    switch (c) {
        case '(': return makeToken(TOKEN_LEFT_PAREN);
        case ')': return makeToken(TOKEN_RIGHT_PAREN);
        case '{': return makeToken(TOKEN_LEFT_BRACE);
        case '}': return makeToken(TOKEN_RIGHT_BRACE);
        case ';': return makeToken(TOKEN_SEMICOLON);
        case ',': return makeToken(TOKEN_COMMA);
        case '.': return makeToken(TOKEN_DOT);
        case '-': return makeToken(TOKEN_MINUS);
        case '+': return makeToken(TOKEN_PLUS);
        case '*': return makeToken(TOKEN_STAR);
        case '/': return makeToken(TOKEN_SLASH);
        case '!':
            return makeToken(
                match('=') ? TOKEN_BANG_EQUAL : TOKEN_BANG);
        case '=':
            return makeToken(
                match('=') ? TOKEN_EQUAL_EQUAL : TOKEN_EQUAL);
        case '<':
            return makeToken(
                match('=') ? TOKEN_LESS_EQUAL : TOKEN_LESS);
        case '>':
            return makeToken(
                match('=') ? TOKEN_GREATER_EQUAL : TOKEN_GREATER);
        case '"': return string();
    }

    return errorToken("Unexpected character.");
}

static void advance(Token* token) {
    for (;;) {
        *token = scanToken();
        if (token->type != TOKEN_ERROR) break;
    }
}

Benchmark

I compile the scanner with gcc -O3 to enable optimizations. I then run the scanner against source files of varying sizes multiple times (500) to measure its performance. I re-run and recompile with the hashmap implementation and the trie implementation.

Since we are just scanning, I composed a ~1k line source file from different unit tests. I then created a couple of versions of this source file where I copied it consecutively size times, i.e. size 128 corresponds to ~128k lines of code.

def execute_command(command):
    result = subprocess.run(command, stdout=subprocess.PIPE)
    return float(result.stdout.strip())

def main():
    results = defaultdict(list)

    for size in [1, 2, 4, 8, 16, 32, 64, 128]:
        for _ in range(500):
            command = ["./a.out", f"code{size}.clox"]
            output = execute_command(command)
            results[size].append(output)

        print(f"size {size}")
        mean = statistics.mean(results[size])
        std_dev = statistics.stdev(results[size])
        print(f"Mean: {mean}")
        print(f"Standard Deviation: {std_dev}")
        print("")

    with open("results_hashmap.pkl", "wb") as file:
        pickle.dump(results, file)

if __name__ == "__main__":
    main()p

Results

Plotting the results yields the plot below. Vertical bars indicate standard deviation.

As expected, both approaches scale linearly with the size of the source file.
Tries are on average faster by about 30%. That’s a pretty sizeable difference, especially since modern IDEs re-run the scanner and parser whenever the user makes any changes to provide syntax highlighting and type hints.

In the test source file, I used pretty standard-length identifier names. If used very long ones, the Trie implementation would show even better results, since it can avoid unnecessarily traversing the full string.

Since the trie implementation is so simple, there’s not much of a reason not to use it.

Thanks for reading! If you liked this, consider subscribing and/or following me on LinkedIn. I like to understand the inner workings of software systems. Whenever I learn something interesting, I compile it into one of these blog posts!

Want to read another one? Consider checking out one of my other posts!