Skip to content

Latest commit

 

History

History
233 lines (133 loc) · 10.7 KB

syntax-reference.md

File metadata and controls

233 lines (133 loc) · 10.7 KB

Syntax Reference

This document describes the syntax of the Ohm language, which is a variant of parsing expression grammars (PEGs). If you have experience with PEGs, the Ohm syntax will mostly look familiar, but there are a few important differences to note:

  • When naming rules, case matters: whitespace is implicitly skipped inside a rule application if the rule name begins with an uppercase letter. For further information, see Syntactic vs. Lexical Rules.
  • Grammars are purely about recognition: they do not contain semantic actions (those are defined separately) or bindings. The separation of semantic actions is one of the defining features of Ohm -- we believe that it improves modularity and makes both grammars and semantics easier to understand.
  • Alternation expressions support case names, which are used in inline rule declarations. This makes semantic actions for alternation expressions simpler and less error-prone.
  • Ohm does not (yet) support semantic predicates.

Ohm is closely related to OMeta, another PEG-based language for parsing and pattern matching. Like OMeta, Ohm supports a few features not supported by many PEG parsing frameworks:

Terminology

<script type="text/markscript"> var ohm = require('ohm-js'); function checkGrammar(source) { assert(ohm.grammar(source)); return ''; } markscript.transformNextBlock(checkGrammar); </script>
Arithmetic {
  Expr = "1 + 1"
}

This is a grammar named "Arithmetic", which has a single rule named "Expr". The right hand side of Expr is known as a "rule body". A rule body may be any valid parsing expression.

Parsing Expressions

Here is a full list of the different kinds of parsing expressions supported by Ohm:

Terminals

These are the fundamental building blocks of Ohm grammars.

String literal

"hello there"

Matches exactly the characters contained inside the quotation marks.

Special characters (", \, and ') can be escaped with a backslash -- e.g., "\"" will match a literal quote character in the input stream. Other valid escape sequences are: \b (backspace), \f (form feed), \n (line feed), \r (carriage return), and \t (tab).

Number literal

-42

Matches a positive or negative integer value.

Keywords

true: Matches the boolean value true.

false: Matches the boolean value false.

null: Matches a null value (or the equivalent in the host language).

Rule Application

ruleName

Matches the body of the rule named ruleName. For example, the built-in rule letter will parse a string of length 1 that is a letter.

ruleName<expr>

Matches the body of the parameterized rule named ruleName, substituting the parsing expression expr as its first parameter. For parameterized rules with more than one parameter, the parameters are comma-separated, e.g. ListOf<field, ";">.

Repetition operators

expr *

Matches the expression expr repeated 0 or more times. E.g., "a"* will match '', 'a', 'aa', ...

Inside a syntactic rule -- any rule whose name begins with an upper-case letter -- spaces before a match are automatically skipped. E.g., "a"* will match " a a" as well as "aa". See the documentation on syntactic and lexical rules for more information.

expr +

Matches the expression expr repeated 1 or more times. E.g., letter+ will match 'x', 'xA', ...

As with the * operator, spaces are skipped when used in a syntactic rule.

expr ?

Tries to match the expression expr, succeeding whether it matches or not. No input is consumed if it does not match.

Sequence

expr1 expr2

Matches the expression expr1 followed by expr2. E.g., "grade" letter will match 'gradeA', 'gradeB', ...

As with the * and + operators, spaces are skipped when used in a syntactic rule. E.g., "grade" letter will match ' grade A' as well as 'gradeA'.

Alternation

expr1 | expr2

Matches the expression expr1, and if that does not succeed, matches the expression expr2. E.g., letter | number will match 'a', '9', ...

Lookahead

& expr

Succeeds if the expression expr can be matched, but does not consume anything from the input stream. Usually used as part of a sequence, e.g. letter &number will match 'a9', but only consume 'a'. &"a" letter+ will match any string of letters that begins with 'a'.

Negative Lookahead

~ expr

Succeeds if the expression expr cannot be matched, and does not consume anything from the input stream. Usually used as part of a sequence, e.g., ~"\n" any will consume any single character that is not a new line character.

Arrays

[ expr ]

Matches an Array object whose contents match expr. E.g., ["hey"] will match an Array having the string 'hey' as its only element.

If expr is a Sequence, it will match successive elements in the Array, beginning at index 0. E.g., ["a" "b" "c"] will match the array ['a', 'b', 'c'].

<script type="text/markscript"> assert(ohm.grammar('G { start = ["a" "b" "c"] }').match(['a', 'b', 'c']).succeeded()); assert(ohm.grammar('G { start = ["a" "b" "c"] }').match(['a', 'b', 'c', 'd']).failed()); assert(ohm.grammar('G { start = ["a" "bc"] }').match(['a', 'b', 'c', 'd']).failed()); </script>

Objects

{ key: expr }

Matches an object with an own property named key whose value matches the parsing expression expr, and no other properties. E.g., {"name": any} matches the object {name: 'Manuel'}, but not the object {name: 'Philip', age: 31}.

{ key: expr, ... }

Like above, but will still match if the object has other properties. E.g., {stars: 4, ...} will match the object {stars: 2, name: 'Noma'}.

<script type="text/markscript"> assert(ohm.grammar('G { start = {"name": any} }').match({name: 'Manuel'}).succeeded()); assert(ohm.grammar('G { start = {"name": any} }').match({name: 'Philip', age: 31}).failed()); assert(ohm.grammar('G { start = {stars: 2, ...} }').match({stars: 2, name: 'Noma'}).succeeded()); </script>

Any number of comma-separated key/expression pairs can be specified. Other valid patterns are {}, which matches an object with no properties, and {...}, which matches any object. NOTE: In Ohm/JS, object patterns will also match Array objects.

Built-in Rules

(See src/built-in-rules.ohm.)

any: Matches a single item from the input stream. For a string, it will match any one character.

letter: Matches a single character which is a letter (either uppercase or lowercase).

lower: Matches a single lowercase letter.

upper: Matches a single uppercase letter.

digit: Matches a single character which is a digit from 0 to 9.

hexDigit: Matches a single character which is a either digit or a letter from A-F.

alnum: Matches a single letter or digit; equivalent to letter | digit.

space: Matches a single whitespace character (e.g., space, tab, newline, etc.)

end: Matches the end of the input stream. Equivalent to ~any.

ListOf<elem, sep>: Matches the expression elem zero or more times, separated by something that matches the expression sep. E.g., ListOf<letter, ","> will match '', 'a', and 'a, b, c'. Additionally there is NonemptyListOf<elem, sep> that matches elem at least one time.

listOf<elem, sep>: Similar to ListOf<elem, sep> but interpreted as lexical rule.

Grammar Syntax

Grammar Inheritance

grammarName <: supergrammarName { ... }

Declares a grammar named grammarName which inherits from supergrammarName.

Defining, Extending, and Overriding Rules

ruleName = expr

Defines a new rule named ruleName in the grammar, with the parsing expression expr as the rule body. Throws an error if a rule with that name already exists in the grammar or one of its supergrammars.

ruleName := expr

Defines a rule named ruleName, overriding a rule of the same name in a supergrammar. Throws an error if no rule with that name exists in a supergrammar.

ruleName += expr

Extends a supergrammar rule named ruleName, throwing an error if no rule with that name exists in a supergrammar. The rule body will effectively be expr | oldBody, where oldBody is the rule body as defined in the supergrammar.

Rule Descriptions

Rule declarations may optionally have a description, which is a parenthesized "comment" following the name of the rule in its declaration. Rule descriptions are used to produce better error messages for end users of a language when input is not recognized. For example:

ident (an identifier) = ~keyword name

Syntactic vs. Lexical Rules

A syntactic rule is a rule whose name begins with an uppercase letter, and lexical rule is one whose name begins with a lowercase letter. The difference between lexical and syntactic rules is that syntactic rules implicitly skip whitespace characters.

For the purposes of a syntactic rule, a "whitespace character" is anything that matches its enclosing grammar's "space" rule. The default implementation of "space" matches ' ', '\t', '\n', '\r', and any other character that is considered whitespace in the ES5 spec.

Inline Rule Declarations

expr -- caseName

When a parsing expression is followed by the characters -- and a name, it signals an inline rule declaration. This is most commonly used in alternation expressions to ensure that each branch has the same arity. For example, the following declaration:

AddExp = AddExp "+" MulExp  -- plus
       | MulExp

is equivalent to:

AddExp = AddExp_plus
       | MulExp
AddExp_plus = AddExp "+" MulExp

Parameterized Rules

TODO