Client-Side SQL Query Parsing with ANTLR

Multiple Queries in the Console Query Editor

Rockset Console’s query editor allows users to type and run queries over collections. Until now, however, whatever was typed in the editor was run and parsed as a single query. This means that, for a user, it wasn’t that easy to switch between multiple queries in our editor. They would have to comment out the queries they didn’t want to run, or keep all their queries in a separate text file and copy over the ones they wanted to run one at a time.

To make switching between multiple queries easier, we decided to allow multiple queries in our editor, split by semicolons, a SQL statement terminator. To achieve this, our editor would have to understand where queries began and ended. Our naive, initial approach was to split the string of the entire text of the editor by semicolons (or perhaps semicolon + new line), and understand each string resulting from this split as a separate query. Of course, this naive approach was not full-proof, because a semicolon could exist within a string or a comment inside a SQL query. We would not want to split on such a semicolon.

Naive splitting on semicolons was not good enough. Our editor would have to understand SQL comments and strings, to know when a semicolon was within one of them, instead of being used to signify the end of a statement. That’s when, in order to implement this more thorough understanding of the editor text, we decided to use ANTLR in the front-end.

ANTLR - A Basic Understanding of SQL Queries

ANTLR is a powerful tool for language parsing. From a specified grammar (a set of rules), ANTLR generates a lexer and parser, which together can build a tree from input (a SQL string in our case), and a listener, which can perform logic while visiting that tree. In fact, we already use ANTLR to parse SQL statements in the back-end, and the tree which we produce from that is used to understand and execute user SQL statements!

Grammar

Let’s start from the beginning. After downloading ANTLR, we need to make the grammar file. As an example, check out Presto’s SQL ANTLR grammar. Rockset’s grammar for query parsing in the back-end is actually a modified version of this Presto grammar. For query splitting in the front-end, though, we are mostly interested in the grammar rules for comments and strings:

STRING
: '\'' ( ~'\'' | '\'\'' )* '\''
;

COMMENT
: SIMPLE_COMMENT | BRACKETED_COMMENT
;

fragment SIMPLE_COMMENT
: '--' ~[\r\n]* '\r'? '\n'?
;

fragment BRACKETED_COMMENT
: '/*' .*? '*/'
;

The crux of what’s happening here is the specification of the sequence of which string characters make the rule, i.e. string or comment. These grammar rules are copied exactly from our back-end grammar file, so we can be sure that the query splitting grammar in the front-end understands comments and strings in the exact same way that our back-end system does. Looking at the rest of our grammar rules:

queriesText
: statement* EOF
;

statement
: ';'* (CHAR | STRING | COMMENT)+ ';'*
;

CHAR
: ~';'
;

We can see that the entire queriesText is just defined as a series of 0 or more statements, and that statements are defined as a mixture of chars (in our case, anything but semicolons), comments, and strings that sit between semicolons. What’s key here is that semicolons can sit inside comments and strings and be part of statements as they should be.

Here is the full grammar file with all the above rules.

Building the Tree

The power of ANTLR is to generate files from this grammar with which we can do our logic.

antlr4 -Dlanguage=JavaScript <GrammarFileName>.g4

is the command used to specify that our target language is JavaScript, and running it generates a host of files for us to use in JavaScript, including a lexer, a parser, and a listener. See this documentation for more details. Let’s look at the function we ultimately wrote using all these files, to understand what they are doing.

import * as antlr4 from 'antlr4';
import { QuerySeparationGrammarLexer } from './QuerySeparationGrammarLexer';
import { QuerySeparationGrammarParser } from './QuerySeparationGrammarParser';
import { QuerySeparationGrammarListener } from './QuerySeparationGrammarListener';
import { CustomListener } from'./CustomListener';
export const SplitQueries = (input) => {
const chars = new antlr4.InputStream(input);
const lexer = new QuerySeparationGrammarLexer(chars);
const tokens = new antlr4.CommonTokenStream(lexer);
const parser = new QuerySeparationGrammarParser(tokens);
parser.buildParseTrees = true;
const tree = parser.queriesText(); // the tree upon which I'll walk
const result = [];
const listener = new CustomListener(result); // custom listener builds on the empty results array
antlr4.tree.ParseTreeWalker.DEFAULT.walk(listener, tree);
return listener.result; // this function returns the start and stop indices.
};
view raw SplitQueries.ts hosted with ❤ by GitHub

Here, we can see that the lexer and parser help us build a tree from string input using our grammar. Take ‘[statement1][statement2][statement3]’ as an example string input, where each [statement] is a section of the string that matches the grammar rule for statement, such as 'abc/*comment;;*/def;'. An input string with three statements would be parsed into a tree that looks like this:

antlr-tree

Every node of the tree is based off of a grammar rule that occurs in the input string, and that node’s children are the components (based off of other grammar rules) that that node contains. The root node is based off of the queriesText grammar rule, which exists in the input string as the whole string. So the root node represents the whole input string. Its children are the components that make up this node. We can see from the queriesText grammar rule, that a queriesText comprises of a series of statements. So in the tree, the children of the queriesText root node are nodes for each of these statements. And likewise, the children of each statement node are that statement's components, i.e. the specific chars, strings, and comments within that statement.

The important thing to note here, is that the statements that exist in the input tree are successfully parsed, based on our grammar, into separate nodes of the tree. The separation of statements within the input string was our initial goal.

The Listener (Walking over the Tree)

Now that we have a workable tree, it’s time to do logic over it. The listener helps us do this. Going back to the SplitQueries function, we can see that we “use” the tree by walking over it with a CustomListener.

import * as antlr4 from 'antlr4';
import { QuerySeparationGrammarLexer } from './QuerySeparationGrammarLexer';
import { QuerySeparationGrammarParser } from './QuerySeparationGrammarParser';
import { QuerySeparationGrammarListener } from './QuerySeparationGrammarListener';
import { CustomListener } from'./CustomListener';
export const SplitQueries = (input) => {
const chars = new antlr4.InputStream(input);
const lexer = new QuerySeparationGrammarLexer(chars);
const tokens = new antlr4.CommonTokenStream(lexer);
const parser = new QuerySeparationGrammarParser(tokens);
parser.buildParseTrees = true;
const tree = parser.queriesText(); // the tree upon which I'll walk
const result = [];
const listener = new CustomListener(result); // custom listener builds on the empty results array
antlr4.tree.ParseTreeWalker.DEFAULT.walk(listener, tree);
return listener.result; // this function returns the start and stop indices.
};
view raw SplitQueries.ts hosted with ❤ by GitHub

The CustomListener is a file we write, inheriting from the generated listener file. The generated base listener provides an enter and exit function for each parsed node that sits in the tree. It is these functions that get executed on the walk of the tree. It works like this:

antlr-tree-walk

During the walk of the tree, entering/exiting each node calls that node’s corresponding enter/exit function in the listener, in our case, of CustomListener (the listener with which we walk over the tree with). The enter/exit functions in the generated listener base class are empty and don’t do anything, so if we want to execute logic upon entering or exiting a node during our walk, we need to override these functions in CustomListener, and our custom functions will be executed.

The only function we need to override in the CustomListener is the exitStatement function, the function called when exiting a statement node in the tree. Remember, a statement node in the tree exactly represents a statement within our input text, thanks to our grammar. Upon exiting a statement node, we want to get the location of that statement in the input text string. Luckily, the listener function takes as an argument the ctx of a node, which contains a lot of information, but specifically useful for us, the ctx contains the start and stop indices of that node in the input string.

And so, the implementation of the exitStatement function becomes really simple: record and store the start and stop indices of each statement upon exiting it, so that at the end of the walk, we have the set of all start and stop indices, telling us where each query statement begins and ends in the input string.

Here is our CustomListener, with the exitStatement function:

var QuerySeparationGrammarListener = require('./QuerySeparationGrammarListener').QuerySeparationGrammarListener;
var CustomListener = function(result) {
this.result = result;
QuerySeparationGrammarListener.call(this); // inherit default listener
return this;
};
// continue inheriting default listener
CustomListener.prototype = Object.create(QuerySeparationGrammarListener.prototype);
CustomListener.prototype.constructor = CustomListener;
// override default listener behavior
CustomListener.prototype.exitStatement = function(ctx) {
this.result.push([ctx.start.start, ctx.stop.stop]); // storing the start and stop indices of every statement
};
exports.CustomListener = CustomListener;
view raw CustomListener.js hosted with ❤ by GitHub

The SplitQueries Function

In conclusion of our use of ANTLR, let’s take one last look at the SplitQueries function. First, we build the grammar defining simple SQL statements and some basic components within them. Then, from this grammar, we generate a lexer, parser, and listener. In our SplitQueries function, we take a string input and use the lexer and parser to build a tree out of it, using the grammar rules. Still inside the SplitQueries function, we then perform a walk over the tree using our customListener, which records the start and stop indices of each statement. SplitQueries returns this set of indices.

From our input text, we get an output of index locations of our separate SQL statements, SQL statements that end with a semicolon, but can contain them within their comments and strings, as defined by the ANTLR grammar.

Selecting User Query Using SplitQueries and Cursor Position

With the ANTLR grammar and parsing capabilities packaged into the SplitQueries function, it is time to conclude this project by using this functionality to select a single user query out of a text of many queries in the console's query editor. What we do for this is:

  1. Use SplitQueries on the user’s input text to get the start and stop indices of each query in that text.
  2. Select the query whose start and stop indices contain the user’s cursor position, meaning we choose the query within which the user has clicked. (In some special cases, like selection of a blank line, we select a neighboring query instead.)

By doing this, we are able to select a user query, separated by semicolons, even if that query has semicolons within its strings or comments, all because SplitQueries is informed by our ANTLR grammar rules when it calculates the query boundaries.

Conclusion

We are excited about our incorporation of rigorous grammar parsing using ANTLR in the front-end. Try typing multiple queries in Rockset Console to see how the feature works! With ANTLR now baked into our front-end, we also look forward to doing different kinds of parsing in the console.

Resources for ANTLR in JavaScript:

Here is a copy of the ANTLR/js code used in Rockset Console. This includes the grammar, generated files, CustomListener, and SplitQueries function. Feel free to tinker with the grammar and regenerate files using:

antlr4 -Dlanguage=JavaScript QuerySeparationGrammar.g4

You can run the SplitQueries script on your own string inputs!

Real-time SQL on raw data

Related Posts

How We Reduced DynamoDB Costs by Using DynamoDB Streams and Scans More Efficiently

Get an inside look at the some of the techniques we used to reduce the cost of ingesting data from DynamoDB.

The Kafka Connect Plugin for Rockset and How It Works

Get an in-depth look at the Kafka Connect Plugin for Rockset and the process to get it listed in Confluent Hub.

Optimizing Bulk Load in RocksDB

What’s the fastest we can load data into RocksDB?