This document describes the syntax of the Ohm language, which is a variant of parsing expression grammars (PEGs). If you have experience with PEGs, the Ohm syntax will mostly look familiar, but there are a few important differences to note:
- When naming rules, case matters: whitespace is implicitly skipped inside a rule application if the rule name begins with an uppercase letter. For further information, see Syntactic vs. Lexical Rules.
- Grammars are purely about recognition: they do not contain semantic actions (those are defined separately) or bindings. The separation of semantic actions is one of the defining features of Ohm -- we believe that it improves modularity and makes both grammars and semantics easier to understand.
- Alternation expressions support case names, which are used in inline rule declarations. This makes semantic actions for alternation expressions simpler and less error-prone.
- Ohm does not (yet) support semantic predicates.
Ohm is closely related to OMeta, another PEG-based language for parsing and pattern matching. Like OMeta, Ohm supports a few features not supported by many PEG parsing frameworks:
- Rule applications can accept parameters. This makes it possible to write higher-order rules, such as the built-in
ListOf
rule. - Grammars can be extended in an object-oriented way -- see Defining, Extending, and Overriding Rules.
- Object and array patterns allow grammars to match structured data.
Arithmetic {
Expr = "1 + 1"
}
This is a grammar named "Arithmetic", which has a single rule named "Expr". The right hand side of Expr is known as a "rule body". A rule body may be any valid parsing expression.
Here is a full list of the different kinds of parsing expressions supported by Ohm:
These are the fundamental building blocks of Ohm grammars.
"hello there"
Matches exactly the characters contained inside the quotation marks.
Special characters ("
, \
, and '
) can be escaped with a backslash -- e.g., "\""
will match a literal quote character in the input stream. Other valid escape sequences are: \b
(backspace), \f
(form feed), \n
(line feed), \r
(carriage return), and \t
(tab).
-42
Matches a positive or negative integer value.
true
: Matches the boolean value true
.
false
: Matches the boolean value false
.
null
: Matches a null value (or the equivalent in the host language).
ruleName
Matches the body of the rule named ruleName. For example, the built-in rule letter
will parse a string of length 1 that is a letter.
ruleName<expr>
Matches the body of the parameterized rule named ruleName, substituting the parsing expression expr as its first parameter. For parameterized rules with more than one parameter, the parameters are comma-separated, e.g. ListOf<field, ";">
.
expr *
Matches the expression expr repeated 0 or more times. E.g., "a"*
will match ''
, 'a'
, 'aa'
, ...
Inside a syntactic rule -- any rule whose name begins with an upper-case letter -- spaces before a match are automatically skipped. E.g., "a"*
will match " a a"
as well as "aa"
. See the documentation on syntactic and lexical rules for more information.
expr +
Matches the expression expr repeated 1 or more times. E.g., letter+
will match 'x'
, 'xA'
, ...
As with the *
operator, spaces are skipped when used in a syntactic rule.
expr ?
Tries to match the expression expr, succeeding whether it matches or not. No input is consumed if it does not match.
expr1 expr2
Matches the expression expr1
followed by expr2
. E.g., "grade" letter
will match 'gradeA'
, 'gradeB'
, ...
As with the *
and +
operators, spaces are skipped when used in a syntactic rule. E.g., "grade" letter
will match ' grade A'
as well as 'gradeA'
.
expr1 | expr2
Matches the expression expr1
, and if that does not succeed, matches the expression expr2
. E.g., letter | number
will match 'a'
, '9'
, ...
& expr
Succeeds if the expression expr
can be matched, but does not consume anything from the input stream. Usually used as part of a sequence, e.g. letter &number
will match 'a9'
, but only consume 'a'. &"a" letter+
will match any string of letters that begins with 'a'.
~ expr
Succeeds if the expression expr
cannot be matched, and does not consume anything from the input stream. Usually used as part of a sequence, e.g., ~"\n" any
will consume any single character that is not a new line character.
[ expr ]
Matches an Array object whose contents match expr. E.g., ["hey"]
will match an Array having the string 'hey'
as its only element.
If expr is a Sequence, it will match successive elements in the Array, beginning at index 0. E.g., ["a" "b" "c"]
will match the array ['a', 'b', 'c']
.
{ key: expr }
Matches an object with an own property named key whose value matches the parsing expression expr, and no other properties. E.g., {"name": any}
matches the object {name: 'Manuel'}
, but not the object {name: 'Philip', age: 31}
.
{ key: expr, ... }
Like above, but will still match if the object has other properties. E.g., {stars: 4, ...}
will match the object {stars: 2, name: 'Noma'}
.
Any number of comma-separated key/expression pairs can be specified. Other valid patterns are {}
, which matches an object with no properties, and {...}
, which matches any object. NOTE: In Ohm/JS, object patterns will also match Array objects.
(See src/built-in-rules.ohm.)
any
: Matches a single item from the input stream. For a string, it will match any one character.
letter
: Matches a single character which is a letter (either uppercase or lowercase).
lower
: Matches a single lowercase letter.
upper
: Matches a single uppercase letter.
digit
: Matches a single character which is a digit from 0 to 9.
hexDigit
: Matches a single character which is a either digit or a letter from A-F.
alnum
: Matches a single letter or digit; equivalent to letter | digit
.
space
: Matches a single whitespace character (e.g., space, tab, newline, etc.)
end
: Matches the end of the input stream. Equivalent to ~any
.
ListOf<elem, sep>
: Matches the expression elem zero or more times, separated by something that matches the expression sep. E.g., ListOf<letter, ",">
will match ''
, 'a'
, and 'a, b, c'
. Additionally there is NonemptyListOf<elem, sep>
that matches elem at least one time.
listOf<elem, sep>
: Similar to ListOf<elem, sep>
but interpreted as lexical rule.
grammarName <: supergrammarName { ... }
Declares a grammar named grammarName
which inherits from supergrammarName
.
ruleName = expr
Defines a new rule named ruleName
in the grammar, with the parsing expression expr
as the rule body. Throws an error if a rule with that name already exists in the grammar or one of its supergrammars.
ruleName := expr
Defines a rule named ruleName
, overriding a rule of the same name in a supergrammar. Throws an error if no rule with that name exists in a supergrammar.
ruleName += expr
Extends a supergrammar rule named ruleName
, throwing an error if no rule with that name exists in a supergrammar. The rule body will effectively be expr | oldBody
, where oldBody
is the rule body as defined in the supergrammar.
Rule declarations may optionally have a description, which is a parenthesized "comment" following the name of the rule in its declaration. Rule descriptions are used to produce better error messages for end users of a language when input is not recognized. For example:
ident (an identifier)
= ~keyword name
A syntactic rule is a rule whose name begins with an uppercase letter, and lexical rule is one whose name begins with a lowercase letter. The difference between lexical and syntactic rules is that syntactic rules implicitly skip whitespace characters.
For the purposes of a syntactic rule, a "whitespace character" is anything that matches its enclosing grammar's "space" rule. The default implementation of "space" matches ' ', '\t', '\n', '\r', and any other character that is considered whitespace in the ES5 spec.
expr -- caseName
When a parsing expression is followed by the characters --
and a name, it signals an inline rule declaration. This is most commonly used in alternation expressions to ensure that each branch has the same arity. For example, the following declaration:
AddExp = AddExp "+" MulExp -- plus
| MulExp
is equivalent to:
AddExp = AddExp_plus
| MulExp
AddExp_plus = AddExp "+" MulExp
TODO