How we ended-up creating our own parser in PHP

In our team, we love experimenting and breaking out of our comfort zone. We know that is the only way to improve! When we realized that the new Tracker Query Langage could effectively respond to businesses’ needs, we knew it was a chance for us to push our own limits, too. We knew that to transform user input into a valid SQL statement, we had to help users write their queries and parse those queries.

So, how did we do it? By relying on two libraries: CodeMirror to deliver the user experience and PEG.js to parse the query.


Building a grammar to parse user input

In Tuleap Trackers, every time a user executes a normal query in a report, we aggregate the criteria with the AND operator and perform the query. The fact that all trackers and reports are customizable makes things pretty complicated under the hood, but the queries have always worked well for users.

But “normal mode” does not cover all use cases. There is no way to use the OR operator, for example, and you cannot use a rolling time window. Users cannot search open artifacts, such as those submitted over the last month and assigned to or submitted by them.

With TQL, this is now possible in “expert mode”:

status = "Open" AND submitted_on >= NOW() - 1m AND (submitted_by =
MYSELF() or assigned_to = MYSELF())

This example shows how powerful TQL is, and the underlying challenges it creates! First, strings need to be parsed to extract comparisons, logical operators, and parentheses. This is basically building a new language, which means that we need a grammar to define it, and a parser based on this grammar.

Let me quote Martin Fowler:

  • You didn't do the compiler class at university and therefore think parsers are scary.
  • You did do the compiler class at university and are therefore convinced that parsers are scary.

I will be the first to admit that parsers are scary. Fortunately, there are theories (which I promise we won’t go into here) and tools to help us generate parsers. As our main language is PHP, we focused our search on a generator able to compute a parser in PHP. No, we didn’t take the regular expressions path! We found old tools like lime-php, lemon-php, and PHP_ParserGenerator. The most recent tool was php-pegjs, which hasn’t see much activity lately, but it is based on PEG.js, which appears to be maintained \o/. After experimenting with the various tools, we chose PEG.js as our new toy.

PEG.js

PEG.js, as its name suggests, is based on PEG and generates parsers in JavaScript. Luckily, the plugin php-pegjs extends pegjs by providing PHP code target instead of JavaScript.

So, once we found the parser generator and the grammar format, this was our workflow:

  1. We built a grammar. Here is an excerpt:
    EqualComparison
        = field:Field _ "=" _ value_wrapper:SimpleExpr {
            return new EqualComparison($field, $value_wrapper);
        }
    …
    _ "whitespace"
        = [ \t\n\r]*
    

    or as a railroad diagram:


  2. We used this grammar to build a parser with php-pegjs. Building the grammar is part of our global yarn build. Of course, the parser generated is not included in the source repository, so as not to clutter the code reviews and to avoid increasing the size of the repository. This parser is generated during Tuleap packaging for production or on the developer environment for testing.
  3. The parser is then used to generate a syntax tree each time the user executes an expert query.
  4. The syntax tree is visited first to validate the query and check invalid fields, invalid values, invalid operators, etc. The tree is then visited to build the final SQL statement that will return the matching artifacts.
  5. So, the parser didn’t turn out to be as scary as we thought! By using a generator, we could focus on building the grammar and using the output. Building our grammar was quite simple once we had a correct operator precedence definition. Thankfully others had addressed this issue before us!

    If you are interested in the architecture used to convert the syntax tree into a valid SQL statement, stay tuned for our next article in this series on TQL!

    Helping users write their queries

    Parsing queries is one thing, but we also needed a way to help users write their queries. As developers, we know that writing code is cumbersome at best without the right tools. Syntax highlighting, which helps distinguish between keywords, values, variables, and other elements, is the bare minimum!

    Query without syntax highlighting:

    status = "Open" AND submitted_on >= NOW() - 1m AND (submitted_by =
    MYSELF() or assigned_to = MYSELF())
    

    Query with syntax highlighting:


    See the difference? Highlighting makes things like missing closing quotes easy to spot. When a closing quote is missing the rest of the query will be considered a string and highlighted as follows:


    Syntax highlighting lets users detect errors as early on as possible, as recommended in the Bastien & Scapin ergonomic criteria (Distinction by format, Immediate Feedback, Error Correction, etc.).

    It is also helpful to users to know which fields they can use in their query. However, the user must use the name of the field, not its label. For example, a field labeled “Original Submission” may have “details” as name; “I Want To” may have “i_want_to” as name. TQL uses the field name, not the label. This makes it harder to write the query as only the tracker administrator knows the name of the field; end users only know the label. Furthermore, not all types of fields are currently supported by TQL.

    To solve this problem, we added a select box next to the query editor to provide the list of supported fields. When the user selects one of the fields (by its label), then its name is added to the query editor. We also added an autocomplete feature (like in Vim, PHPStorm, or Eclipse) that suggests code to the user.


    So, this means that TQL needs a web-based code editor instead of a regular text area. The good news is that we had already included one in Tuleap pull request plugin: CodeMirror. Of course, there are alternatives like AceEditor.

    CodeMirror

    CodeMirror is a powerful code editor that is easy to customize and configure to suit your needs. Syntax highlighting is configured using CodeMirror’s simple mode:

    CodeMirror.defineSimpleMode("tql", {
      start: [
        {
          regex: /"(?:[^\\]|\\.)*?(?:"|$)/, // double quotes
          token: "string"
        },
        {
          regex: /'(?:[^\\]|\\.)*?(?:'|$)/, // single quotes
          token: "string"
        },
        {
          regex: /\d+[dwmy]/i, // Time period
          token: "variable-3"
        },
        {
          regex: /\d+(?:\.\d+)?/i, // Float & integers
          token: "number"
        },
        {
          regex: /(?:and|or)\b/i,
          token: "keyword"
        },
        {
          regex: /(?:now|between|in|not|myself)\b/i,
          token: "variable-2"
        },
        {
          regex: /[=<>!+-]+/,
          token: "operator"
        },
        {
          regex: /[(]/,
          token: "operator",
          indent: true
        },
        {
          regex: /[)]/,
          token: "operator",
          dedent: true
        },
        {
          regex: /[a-zA-Z0-9_]+/,
          token: "variable"
        }
      ]
    });
    

    With Simple Mode, we have just ten syntax highlighting rules, which is sufficient for the current needs of TQL. When TQL grows we might want to write our own mode.

    The autocomplete feature required a bit more work. Nothing difficult, though. You can take a look at the code if you are curious.

    Here is TQL in all its glory:

    Wrap-up

    The feedback from end users on TQL is very positive. This article explained how PEG.js and CodeMirror contributed to TQL’s success. PEG.js was a fairly easy way to get a decent parser; CodeMirror delivered syntax highlighting and autocomplete for a great user experience. These features were not in the initial estimation for the user story, but we rapidly concluded that they were essential. As Charles Eames said:

    “The details are not the details; they make the design.”

    We are glad we decided to get out of our comfort zone and focus on these “details.” And our users are pretty pleased, too!

    Install latest Tuleap Join Online meetings

Share this post

Comments (2)

  • anon

    Regarding Domain Specific Language approach, have you considered graphical DSL? Maybe that's one possibility to explore, and I would suggest having a look at Sirius (http://www.eclipse.org/sirius/).

    Apr 20, 2017
  • anon

    Hi Éric,

    Graphical DSL were on our radar during the first discussions about the subject, but we quickly conclude that we should focus on text-based entry for a first version of our "expert mode" because:

    1. It was the fatest solution to build
    2. It does not prevent to switch to a graphical UI later

    We already know Sirius since we worked with Obeo to develop the Mylyn connector for Tuleap. Our REST API allows to query artifacts with TQL based queries, therefore I think that some investigations can be made to enhance the connector with Sirius in order to build graphically a query...

    Thanks for the suggestion :)

    Apr 20, 2017

Leave a comment

To prevent automated submissions please leave this field empty.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.