Documentation
Table of Contents
- Installation
- Supported Runtimes
- Generating a Parser
- Using the Parser
- Grammar Syntax and Semantics
- Peggy Identifiers
- Error Messages
- Locations
- Plugins API
- Compatibility
Installation
Note: When you pre-generate a grammar using the Peggy Command Line Interface, no runtime is required, and Peggy can be a development-only dependency.
Node.js
To use the peggy
command:
$ npm install --save-dev peggy
$ npx peggy --help
In your package.json
file, you can do something like:
{
"scripts": {
"parser": "peggy -o lib/parser.js --format es src/parser.peggy"
}
}
Browser
NOTE: For most uses of Peggy, use the command line version at build time, outputting the generated parser to a static JavaScript file that you can import later as needed. The API is most useful for tooling that needs to process user-edited grammar source, such as the online Peggy editor. Generating the parser at runtime can be much slower than executing pre-generated code.
The easiest way to use Peggy from the browser is to pull the latest version from a CDN. Either of these should work:
<script src="https://unpkg.com/peggy"></script>
<script src="https://cdn.jsdelivr.net/npm/peggy"></script>
Both of those CDNs support pinning a version number rather than always taking the latest. Not only is that good practice, it will save several redirects, improving performance. See their documentation for more information:
When your document is done loading, there will be a global peggy
object.
Supported Runtimes
Browsers and JS runtimes that pass the following Browserslist query at the time of release are supported:
defaults, maintained node versions, not op_mini all
Opera Mini can't be bothered to implement URL
of all things,
so it's not worth our time to support.
Deno issues will be fixed if possible, but we are not currently testing on Deno.
All versions of Internet Explorer are EXPLICITLY unsupported, for both generating and running generated parsers.
Generating a Parser
Peggy generates a parser from a grammar that describes the expected input and can specify what the parser returns (using semantic actions on matched parts of the input). The generated parser itself is a JavaScript object with a small API.
Command Line
To generate a parser from your grammar, use the peggy
command:
$ npx peggy arithmetics.pegjs
This writes parser source code into a file with the same name as the grammar file but with “.js” extension. You can also specify the output file explicitly:
$ npx peggy -o arithmetics-parser.js arithmetics.pegjs
If you omit both input and output file, standard input and standard output are used.
If you specify multiple input files, they will be folded together in the
order specified before generating a parser. If generating the "es" format,
import
statements in the top-level initializers from each of the
inputs will be moved to the top of the generated code in reverse order of the
inputs, and all other top-level initializers will be inserted directly after
those imports, also in reverse order of the inputs. This approach can be used
to keep libraries of often-used grammar rules in
separate files.
By default, the generated parser is in the commonjs module format. You can
override this using the --format
option.
You can tweak the generated parser with several options:
--allowed-start-rules <rules>
- Comma-separated list of rules the parser will be allowed to start parsing from. Use '*' if you want any rule to be allowed as a start rule. (default: only the first rule in the grammar).
--ast
- Outputting an internal AST representation of the grammar after
all optimizations instead of the parser source code. Useful for plugin authors
to see how their plugin changes the AST. This option cannot be mixed with the
-t/--test
,-T/--test-file
and-m/--source-map
options. --cache
- Makes the parser cache results, avoiding exponential parsing time in pathological cases but making the parser slower.
-d
,--dependency <[name:]module>
- Makes the parser require a specified dependency (can be specified multiple times). A variable name for the import/require/etc. may be given, followed by a colon. If no name is given, the module name will also be used for the variable name.
-D
,--dependencies <json>
- Dependencies, in JSON object format with variable:module pairs. (Can be specified multiple times).
--dts
- Generate a .d.ts file next to the output .js file containing TypeScript types for the generated parser. See Generating TypeScript Types for more information.
-e
,--export-var <variable>
- Name of a global variable into which the parser object is assigned to when no module loader is detected.
--extra-options <options>
- Additional options (in JSON format, as an object) to pass to
peg.generate
. -c
,--extra-options-file <file>
- File with additional options (in JSON format, as an object) to pass to
peg.generate
. --format <format>
- Format of the generated parser:
amd
,commonjs
,globals
,umd
,es
(default:commonjs
). -o
,--output <file>
- File to send output to. Defaults to input file name with
extension changed to
.js
, or stdout if no input file is given. --plugin
- Makes Peggy use a specified plugin (can be specified multiple times).
-m
,--source-map <file>
- Generate a source map. If name is not specified, the source map will be
named "<input_file>.map" if input is a file and "source.map" if input
is a standard input. If the special filename
inline
is given, the sourcemap will be embedded in the output file as a data URI. If the filename is prefixed withhidden:
, no mapping URL will be included so that the mapping can be specified with an HTTP SourceMap: header. This option conflicts with the-t/--test
and-T/--test-file
options unless-o/--output
is also specified --return-types <JSON object>
- If
--dts
is specified, thetypeInfo
provided will be used to specify the return type of the given rules.typeInfo
should be specified as a JSON object whose keys are rule names and whose values are strings containing the return type for that rule. See Generating TypeScript Types for more information. -S
,--start-rule <rule>
- When testing, use this rule as the start rule. Automatically added to the allowedStartRules.
-t
,--test <text>
- Test the parser with the given text, outputting the result of running the parser against this input. If the input to be tested is not parsed, the CLI will exit with code 2.
-T
,--test-file <text>
- Test the parser with the contents of the given file, outputting the result of running the parser against this input. If the input to be tested is not parsed, the CLI will exit with code 2.
--trace
- Makes the parser trace its progress.
-w,--watch
- Watch the input file for changes, generating the output once at the start, and again whenever the file changes.
-v
,--version
- Output the version number.
-h
,--help
- Display help for the command.
If you specify options using -c <file>
or
--extra-options-file <file>
, you will need to ensure you
are using the correct types. In particular, you may specify "plugin" as a
string, or "plugins" as an array of objects that have a use
method. Always use the long (two-dash) form of the option, without the
dashes, as the key. Options that contain internal dashes should be specified
in camel case. You may also specify an "input" field instead of using the
command line. For example:
// config.js or config.cjs
module.exports = {
allowedStartRules = ["foo", "bar"],
format: "umd",
exportVar: "foo",
input: "fooGrammar.peggy",
plugins: [require("./plugin.js")],
testFile: "myTestInput.foo",
trace: true,
};
You can test generated parser immediately if you specify the
-t/--test
or -T/--test-file
option. This option conflicts with the
--ast
option, and also conflicts with the
-m/--source-map
option unless -o/--output
is also
specified.
The CLI will exit with the code:
0
: if successful1
: if you supply incorrect or conflicting parameters2
: if you specified the-t/--test
or-T/--test-file
option and the specified input fails parsing with the specified grammar
Examples:
# - write test results to stdout (42)
# - exit with the code 0
echo "foo = '1' { return 42 }" | npx peggy --test 1
# - write a parser error to stdout (Expected "1" but "2" found)
# - exit with the code 2
echo "foo = '1' { return 42 }" | npx peggy --test 2
# - write an error to stdout (Generation of the source map is useless if you don't
# store a generated parser code, perhaps you forgot to add an `-o/--output` option?)
# - exit with the code 1
echo "foo = '1' { return 42 }" | npx peggy --source-map --test 1
# - write an error to stdout (Generation of the source map is useless if you don't
# store a generated parser code, perhaps you forgot to add an `-o/--output` option?)
# - exit with the code 1
echo "foo = '1' { return 42 }" | npx peggy --source-map --test 2
# - write an output to `parser.js`,
# - write a source map to `parser.js.map`
# - write test results to stdout (42)
# - exit with the code 0
echo "foo = '1' { return 42 }" | npx peggy --output parser.js --source-map --test 1
# - write an output to `parser.js`,
# - write a source map to `parser.js.map`
# - write a parser error to stdout (Expected "1" but "2" found)
# - exit with the code 2
echo "foo = '1' { return 42 }" | npx peggy --output parser.js --source-map --test 2
JavaScript API
Importing
Note again: this is an advanced usage of Peggy. Most of the core use cases of Peggy should prefer to generate a grammar at build time using the CLI.
In Node.js, require the Peggy parser generator module:
const peggy = require("peggy");
or:
import peggy from "peggy";
With some configurations of TypeScript or other tools, you might need:
import * as peggy from "peggy";
For use in browsers, include the Peggy library in your web page or
application using the <script>
tag. If Peggy detects an
AMD loader, it will define
itself as a module, otherwise the API will be available in the
peg
global object.
In deno, you can import through a CDN like this:
import peggy from "https://esm.sh/peggy"; // Note: add @version in production
Generating a parser with the API
To generate a parser, call the peggy.generate
method and pass your
grammar as a parameter:
const parser = peggy.generate("start = ('a' / 'b')+");
The method will return a generated parser object or its source code as a string
(depending on the value of the output
option — see below). It will
throw an exception if the grammar is invalid. The exception will contain a
message
property with more details about the error.
If your grammar is split across multiple files, instead of passing a string
as the first parameter of peggy.generate
, pass an array
containing objects with "source"
and "text"
keys:
const parser = peggy.generate([
{ source: "file1.peggy", text: "numbers = number|.., ','|" },
{ source: "lib.peggy", text: "number = n:$[0-9]+ { return parseInt(n, 10) }" },
]);
Note that this is the same format the
GrammarError.format()
function
expects.
You can tweak the generated parser by passing a second parameter with an
options object to peg.generate
. The following options are
supported:
allowedStartRules
- Rules the parser will be allowed to start parsing from (default: the first rule in the grammar). If any of the rules specified is "*", any of the rules in the grammar can be used as start rules.
cache
- If
true
, makes the parser cache results, avoiding exponential parsing time in pathological cases but making the parser slower (default:false
). dependencies
- Parser dependencies. The value is an object which maps variables used to
access the dependencies in the parser to module IDs used to load them; valid
only when
format
is set to"amd"
,"commonjs"
,"es"
, or"umd"
. Dependencies variables will be available in both the global initializer and the per-parse initializer. Unless the parser is to be generated in different formats, it is recommended to rather import dependencies from within the global initializer (default:{}
). error
- Callback for errors. See Error Reporting
exportVar
- Name of a global variable into which the parser object is assigned to when
no module loader is detected; valid only when
format
is set to"globals"
or"umd"
(default:null
). format
-
Format of the generated parser (
"amd"
,"bare"
,"commonjs"
,"es"
,"globals"
, or"umd"
); valid only whenoutput
is set to"source"
,"source-and-map"
, or"source-with-inline-map"
. (default:"bare"
). grammarSource
-
A string or object representing the "origin" of the input string being
parsed. The
location()
API returns the suppliedgrammarSource
in thesource
key. As an example, if you pass ingrammarSource
as "main.js", then errors with locations include{ source: 'main.js', ... }
.
If you pass an object, the location API returns the entire object in thesource
key. If you format an error containing a location withformat()
, the formatter stringifies the object. If you pass an object, we recommend you add atoString()
method to the object to improve error messages. info
- Callback for informational messages. See Error Reporting
output
-
A string, one of:
"source"
- return parser source code as a string."parser"
- return a generated parser object. This is just the "source" output that has had `eval` run on it. As such, some formats, such as "es" may not work."source-and-map"
- return aSourceNode
object; you can get source code by callingtoString()
method or source code and mapping by callingtoStringWithSourceMap()
method, see theSourceNode
documentation."source-with-inline-map"
- return the parser source along with an embedded source map as adata:
URI. This option leads to a larger output string, but is the easiest to integrate with developer tooling."ast"
- return the internal AST of the grammar as a JSON string. Useful for plugin authors to explore internals of Peggy and for automation.
(default:
"parser"
)Note: You should also set
grammarSource
to a not-empty string if you set this value to"source-and-map"
or"source-with-inline-map"
. The path should be relative to the location where the generated parser code will be stored. For example, if you are generatinglib/parser.js
fromsrc/parser.peggy
, then your options should be:{ grammarSource: "../src/parser.peggy" }
plugins
- Plugins to use. See the Plugins API section.
trace
- Makes the parser trace its progress (default:
false
). warning
- Callback for warnings. See Error Reporting
Error Reporting
While generating the parser, the compiler may throw a GrammarError
which collects
all of the issues that were seen.
There is also another way to collect problems as fast as they are reported — register one or more of these callbacks:
error(stage: Stage, message: string, location?: LocationRange, notes?: DiagnosticNote[]): void
warning(stage: Stage, message: string, location?: LocationRange, notes?: DiagnosticNote[]): void
info(stage: Stage, message: string, location?: LocationRange, notes?: DiagnosticNote[]): void
All parameters are the same as the parameters of the reporting API except the first.
The stage
represent one of possible stages during which execution a diagnostic was generated.
This is a string enumeration, that currently has one of three values:
check
transform
generate
Generating TypeScript Types
If you are consuming the generated parser from TypeScript, it is useful for there to be a .d.ts file next to the generated .js file that describes the types used in the parser. To enable this, use a configuration file such as:
// MJS
export default {
input: "foo.peggy",
output: "foo.js",
dts: true,
returnTypes: {
foo: "string",
},
};
If a rule name is in the allowedStartRules, but not in returnTypes,
any
will be used as the return type for that rule.
Note that --return-types <JSON object>
can be specified
on the command line; the use of a config file just makes quoting easier to get
correct.
Using the Parser
To use the generated parser, import it using your selected module approach
if needed, then call its parse
method and pass an input string as a parameter. The method will return a parse
result (the exact value depends on the grammar used to generate the parser) or
throw an exception if the input is invalid. The exception will contain
location
, expected
, found
,
message
, and diagnostic
properties with more details about the error. The error
will have a format(SourceText[])
function, to which you pass an array
of objects that look like { source: grammarSource, text: string }
; this
will return a nicely-formatted error suitable for human consumption.
parser.parse("abba"); // returns ["a", "b", "b", "a"]
parser.parse("abcd"); // throws an exception
You can tweak parser behavior by passing a second parameter with an options
object to the parse
method. The following options are
supported:
startRule
- Name of the rule to start parsing from.
tracer
-
Tracer to use. A tracer is an object containing a
trace()
function.trace()
takes a single parameter which is an object containing "type" ("rule.enter", "rule.fail", "rule.match"), "rule" (the rule name as a string), "location", and, if the type is "rule.match", "result" (what the rule returned). ...
(any others)- Made available in the
options
variable
As you can see above, parsers can also support their own custom options. For example:
const parser = peggy.generate(`
{
// options are available in the per-parse initializer
console.log(options.validWords); // outputs "[ 'boo', 'baz', 'boop' ]"
}
validWord = @word:$[a-z]+ &{ return options.validWords.includes(word) }
`);
const result = parser.parse("boo", {
validWords: [ "boo", "baz", "boop" ]
});
console.log(result); // outputs "boo"
Grammar Syntax and Semantics
The grammar syntax is similar to JavaScript in that it is not line-oriented
and ignores whitespace between tokens. You can also use JavaScript-style
comments (// ...
and /* ... */
).
Let's look at example grammar that recognizes simple arithmetic expressions
like 2*(3+4)
. A parser generated from this grammar computes their
values.
start
= additive
additive
= left:multiplicative "+" right:additive { return left + right; }
/ multiplicative
multiplicative
= left:primary "*" right:multiplicative { return left * right; }
/ primary
primary
= integer
/ "(" additive:additive ")" { return additive; }
integer "simple number"
= digits:[0-9]+ { return parseInt(digits.join(""), 10); }
On the top level, the grammar consists of rules (in our example,
there are five of them). Each rule has a name (e.g.
integer
) that identifies the rule, and a parsing
expression (e.g. digits:[0-9]+ { return parseInt(digits.join(""), 10); }
)
that defines a pattern to match against the input text and
possibly contains some JavaScript code that determines what happens when the
pattern matches successfully. A rule can also contain human-readable
name that is used in error messages (in our example, only the
integer
rule has a human-readable name). The parsing starts at the
first rule, which is also called the start rule.
A rule name must be a Peggy identifier. It is followed by an equality sign (“=”) and a parsing expression. If the rule has a human-readable name, it is written as a JavaScript string between the rule name and the equality sign. Rules need to be separated only by whitespace (their beginning is easily recognizable), but a semicolon (“;”) after the parsing expression is allowed.
The first rule can be preceded by a global initializer and/or a per-parse initializer, in that order. Both are pieces of JavaScript code in double curly braces (“{{” and “}}”) and single curly braces (“{” and “}”) respectively. All variables and functions defined in both initializers are accessible in rule actions and semantic predicates. Curly braces in both initializers code must be balanced.
The global initializer is executed once and only once, when the
generated parser is loaded (through a require
or an
import
statement for instance). It is the ideal location to
require, to import, to declare constants, or to declare utility functions to be used in rule actions
and semantic predicates.
The per-parse initializer is called before the generated parser
starts parsing. The code inside the per-parse initializer can access
the input string and the options passed to the parser using the
input
variable and the options
variable respectively.
It is the ideal location to create data structures that are unique to each
parse or to modify the input before the parse.
Let's look at the example grammar from above using a global initializer and a per-parse initializer:
{{
function makeInteger(o) {
return parseInt(o.join(""), 10);
}
}}
{
if (options.multiplier) {
input = `(${input})*(${options.multiplier})`;
}
}
start
= additive
additive
= left:multiplicative "+" right:additive { return left + right; }
/ multiplicative
multiplicative
= left:primary "*" right:multiplicative { return left * right; }
/ primary
primary
= integer
/ "(" additive:additive ")" { return additive; }
integer "simple number"
= digits:[0-9]+ { return makeInteger(digits); }
The parsing expressions of the rules are used to match the input text to the grammar. There are various types of expressions — matching characters or character classes, indicating optional parts and repetition, etc. Expressions can also contain references to other rules. See detailed description below.
If an expression successfully matches a part of the text when running the generated parser, it produces a match result, which is a JavaScript value. For example:
- An expression matching a literal string produces a JavaScript string containing matched text.
- An expression matching repeated occurrence of some subexpression produces a JavaScript array with all the matches.
The match results propagate through the rules when the rule names are used in expressions, up to the start rule. The generated parser returns start rule's match result when parsing is successful.
One special case of parser expression is a parser action — a piece of JavaScript code inside curly braces (“{” and “}”) that takes match results of the preceding expression and returns a JavaScript value. This value is then considered match result of the preceding expression (in other words, the parser action is a match result transformer).
In our arithmetics example, there are many parser actions. Consider the
action in expression digits:[0-9]+ { return parseInt(digits.join(""), 10); }
.
It takes the match result of the expression [0-9]+, which is an array
of strings containing digits, as its parameter. It joins the digits together to
form a number and converts it to a JavaScript number
object.
Importing External Rules
Sometimes, you want to split a large grammar into multiple files for ease of editing, reuse in multiple higher-level grammars, etc. There are two ways to accomplish this in Peggy:
-
From the Command Line, include multiple source files. This will generate the least total amount of code, since the combined output will only have the runtime overhead included once. The resulting code will be slightly more performant, as there will be no overhead to call between the rules defined in different files at runtime. Finally, Peggy will be able to perform better checks and optimizations across the combined grammar with this approach, since the combination is applied before any other rules. For example:
csv.peggy
:a = number|1.., "," WS| WS = [ \t]*
number.peggy
:number = n:$[0-9]+ { return parseInt(n, 10); }
Generate:
$ npx peggy csv.peggy number.peggy
-
The downside of the CLI approach is that editor tooling will not be able to detect that rules coming from another file -- references to such rules will be shown with errors like
Rule "number" is not defined
. Furthermore, you must rely on getting the CLI or API call correct, which is not possible in all workflows.The second approach is to use ES6-style
import
statements at the top of your grammar to import rules into the local rule namespace. For example:csv_imp.peggy
:import {number} from "./number.js" a = number|1.., "," WS| WS = [ \t]*
Note that the file imported from is the compiled version of the grammar, NOT the source. Grammars MUST be compiled by a version that supports imports in order to be imported. Only rules that are allowed start rules are valid. It can be useful to specify
--allowed-start-rules *
(with appropriate escaping for your shell!) in library grammars. Imports are only valid in output formats "es" and "commonjs". If you use imports, you should use{ output: "source" }
; the default output of "parser" will call `eval` on the source which fails immediately for some formats (e.g. "es") and will not find modules in the expected places for others (e.g. "commonjs"). The from-mem project is used by the Peggy CLI to resolve these issues, but note well its relatively severe limitations.All of the following are valid:
import * as num from "number.js" // Call with num.number
import num from "number.js" // Calls the default rule
import {number, float} "number.js" // Import multiple rules by name
import {number as NUM} "number.js" // Rename the local rule to NUM to avoid colliding
import {"number" as NUM} "number.js" // Valid in ES6
import integer, {float} "number.js" // The default rule and some named rules
import from "number.js" // Just the top-level initializer side-effects
import {} "number.js" // Just the top-level initializer side-effects
Parsing Expression Types
There are several types of parsing expressions, some of them containing
subexpressions and thus forming a recursive structure. Each example below is
a part of a full grammar, which produces an
object that contains match
and rest
.
match
is the part of the input that matched the example,
rest
is any remaining input after the match.
"literal"
'literal'-
Match exact literal string and return it. The string syntax is the same as in JavaScript. Appending
i
right after the literal makes the match case-insensitive.Example:literal = "foo"
Matches:"foo"
Does not match:"Foo"
,"fOo"
,"bar"
,"fo"
Try it:Example:literal_i = "foo"i
Matches:"foo"
,"Foo"
,"fOo"
Does not match:"bar"
,"fo"
Try it: .
(U+002E: FULL STOP, or "period")-
Match exactly one character and return it as a string.
Example:any = .
Matches:"f"
,"."
," "
Does not match:""
Try it: !.
(END OF INPUT)-
Match END OF INPUT. This Bang Dot sequence will specify that the end of input should be matched.
"f" !.
will test for end of input after the character "f".Example:no_input = !.
Matches:""
Does not match:"f"
Try it:Example:end_of_input = "f" !.
Matches:"f[EOI]"
Does not match:"f [EOI]"
,""
Try it: [characters]
-
Match one character from a set and return it as a string. The characters in the list can be escaped in exactly the same way as in JavaScript string. The list of characters can also contain ranges (e.g.
[a-z]
means “all lowercase letters”). Preceding the characters with^
inverts the matched set (e.g.[^a-z]
means “all character but lowercase letters”). Appendingi
right after the class makes the match case-insensitive.Example:class = [a-z]
Matches:"f"
Does not match:"A"
,"-"
,""
Try it:Example:not_class_i = [^a-z]i
Matches:"="
," "
Does not match:"F"
,"f"
,""
Try it: rule
-
Match a parsing expression of a rule (perhaps recursively) and return its match result.
Example:rule = child; child = "foo"
Matches:"foo"
Does not match:"Foo"
,"fOo"
,"bar"
,"fo"
Try it: ( expression )
-
Match a subexpression and return its match result. Parentheses create a new local context for the Action Execution Environment as well as plucks with the
@
operator. Note that the action block in the following example returns2
from the parenthesized expression, NOT from the rule -- the rule returns an array of2
's due to the+
operator.Example:paren = ("1" { return 2; })+
Matches:"11"
Does not match:"2"
,""
Try it:Similarly, in the next example, the pluck operator applies to the return value of the parentheses, not the rule:
Example:paren_pluck = (@[0-9] ",")+
Matches:"1,"
,"2,3,"
Does not match:"2"
,","
Try it: expression *
-
Match zero or more repetitions of the expression and return their match results in an array. The matching is greedy, i.e. the parser tries to match the expression as many times as possible. Unlike in regular expressions, there is no backtracking.
Example:star = "a"*
Matches:"a"
,"aaa"
Does not match: (always matches)Try it: expression +
-
Match one or more repetitions of the expression and return their match results in an array. The matching is greedy, i.e. the parser tries to match the expression as many times as possible. Unlike in regular expressions, there is no backtracking.
Example:plus = "a"+
Matches:"a"
,"aaa"
Does not match:"b"
,""
Try it: expression |count|
expression |min..max|
expression |count, delimiter|
expression |min..max, delimiter|-
Match exact
count
repetitions ofexpression
. If the match succeeds, return their match results in an array.-or-
Match expression at least
min
but not more thenmax
times. If the match succeeds, return their match results in an array. Bothmin
andmax
may be omitted. Ifmin
is omitted, then it is assumed to be0
. Ifmax
is omitted, then it is assumed to be infinity. Henceexpression |..|
is equivalent toexpression |0..|
andexpression *
expression |1..|
is equivalent toexpression +
expression |..1|
is equivalent toexpression ?
Optionally,
delimiter
expression can be specified. The delimiter is a separate parser expression, its match results are ignored, and it must appear between matched expressions exactly once.count
,min
andmax
can be represented as:- positive integer:
start = "a"|2|;
- name of the preceding label:
start = count:n1 "a"|count|; n1 = n:$[0-9] { return parseInt(n); };
- code block:
start = "a"|{ return options.count; }|;
Any non-number values, returned by the code block, will be interpreted as
0
.Example:repetition = "a"|2..3, ","|
Matches:"a,a"
,"a,a,a"
Does not match:"a"
,"b,b"
,"a,a,a,"
,"a,a,a,a"
Try it: expression ?
-
Try to match the expression. If the match succeeds, return its match result, otherwise return
null
. Unlike in regular expressions, there is no backtracking.Example:maybe = "a"?
Matches:"a"
,""
Does not match: (always matches)Try it: & expression
-
This is a positive assertion. No input is consumed.
Try to match the expression. If the match succeeds, just return
undefined
and do not consume any input, otherwise consider the match failed.Example:posAssertion = "a" &"b"
Matches:"ab"
Does not match:"ac"
,"a"
,""
Try it: ! expression
-
This is a negative assertion. No input is consumed.
Try to match the expression. If the match does not succeed, just return
undefined
and do not consume any input, otherwise consider the match failed.Example:negAssertion = "a" !"b"
Matches:"a"
,"ac"
Does not match:"ab"
,""
Try it: & { predicate }
-
This is a positive assertion. No input is consumed.
The predicate should be JavaScript code, and it's executed as a function. Curly braces in the predicate must be balanced.
The predicate should
return
a boolean value. If the result is truthy, it's match result isundefined
, otherwise the match is considered failed. Failure to include thereturn
keyword is a common mistake.The predicate has access to all variables and functions in the Action Execution Environment.
Example:posPredicate = @num:$[0-9]+ &{return parseInt(num, 10) < 100}
Matches:"0"
,"99"
Does not match:"100"
,"-1"
,""
Try it: ! { predicate }
-
This is a negative assertion. No input is consumed.
The predicate should be JavaScript code, and it's executed as a function. Curly braces in the predicate must be balanced.
The predicate should
return
a boolean value. If the result is falsy, it's match result isundefined
, otherwise the match is considered failed.The predicate has access to all variables and functions in the Action Execution Environment.
Example:negPredicate = @num:$[0-9]+ !{ return parseInt(num, 10) < 100 }
Matches:"100"
,"156"
Does not match:"56"
,"-1"
,""
Try it: $ expression
-
Try to match the expression. If the match succeeds, return the matched text instead of the match result.
If you need to return the matched text in an action, you can use the
text()
function, but returning a labeled$
expression is sometimes more clear..Example:dollar = $"a"+
Matches:"a"
,"aa"
Does not match:"b"
,""
Try it: label : expression
-
Match the expression and remember its match result under given label. The label must be a Peggy identifier.
Labeled expressions are useful together with actions, where saved match results can be accessed by action's JavaScript code.
Example:label = foo:"bar"i { return {foo}; }
Matches:"bar"
,"BAR"
Does not match:"b"
,""
Try it: @ ( label : )? expression
-
Match the expression and if the label exists, remember its match result under given label. The label must be a Peggy identifier, and must be valid as a function parameter in the language that is being generated (by default, JavaScript). Labels are only useful for later reference in a semantic predicate at this time.
Return the value of this expression from the rule, or "pluck" it. You may not have an action for this rule. The expression must not be a semantic predicate (
&{ predicate }
or!{ predicate }
). There may be multiple pluck expressions in a given rule, in which case an array of the plucked expressions is returned from the rule.Pluck expressions are useful for writing terse grammars, or returning parts of an expression that is wrapped in parentheses.
Example:pluck_1 = @$"a"+ " "+ @$"b"+
Matches:"aaa bb"
,"a "
Does not match:"b"
," "
Try it:Example:pluck_2 = @$"a"+ " "+ @two:$"b"+ &{ return two.length < 3 }
Matches:"aaa b"
,"a bb"
Does not match:"a bbbb"
,"b"
," "
Try it: expression1 expression2 ... expressionn
-
Match a sequence of expressions and return their match results in an array.
Example:sequence = "a" "b" "c"
Matches:"abc"
Does not match:"b"
," "
Try it: expression { action }
-
If the expression matches successfully, run the action, otherwise consider the match failed.
The action should be JavaScript code, and it's executed as a function. Curly braces in the action must be balanced.
The action should
return
some value, which will be used as the match result of the expression.The action has access to all variables and functions in the Action Execution Environment.
Example:action = " "+ "a" { return location(); }
Matches:" a"
Does not match:"a"
," "
Try it: -
expression1 / expression2 / ... / expressionn
-
Try to match the first expression, if it does not succeed, try the second one, etc. Return the match result of the first successfully matched expression. If no expression matches, consider the match failed.
Example:alt = "a" / "b" / "c"
Matches:"a"
,"b"
,"c"
Does not match:"d"
,""
Try it:
Action Execution Environment
Actions and predicates have these variables and functions available to them.
-
All variables and functions defined in the initializer or the top-level initializer at the beginning of the grammar are available.
-
Note, that all functions and variables, described below, are unavailable in the global initializer.
-
Labels from preceding expressions are available as local variables, which will have the match result of the labelled expressions.
A label is only available after its labelled expression is matched:
rule = A:('a' B:'b' { /* B is available, A is not */ } )
A label in a sub-expression is only valid within the sub-expression:
rule = A:'a' (B:'b') (C:'b' { /* A and C are available, B is not */ })
-
input
is a parsed string that was passed to theparse()
method. -
options
is a variable that contains the parser options. That is the same object that was passed to theparse()
method. -
error(message, where)
will report an error and throw an exception.where
is optional; the default is the value oflocation()
. -
expected(message, where)
is similar toerror
, but reportsExpected message but "other" found.
where
other
is, by default, the character in thelocation().start.offset
position. -
location()
returns an object with the information about the parse position. Refer to the corresponding section for the details. -
range()
is similar tolocation()
, but returns an object with offsets only. Refer to the "Locations" section for the details. -
offset()
returns only the start offset, i.e.location().start.offset
. Refer to the "Locations" section for the details. -
text()
returns the source text betweenstart
andend
(which will be""
for predicates). Instead of using that function as a return value for the rule consider using the$
operator.
Parsing Lists
One of the most frequent questions about Peggy grammars is how to parse a delimited list of items. The cleanest current approach is:
list
= word|.., _ "," _|
word
= $[a-z]i+
_
= [ \t]*
If you want to allow a trailing delimiter, append it to the end of the rule:
list
= word|.., delimiter| delimiter?
delimiter
= _ "," _
word
= $[a-z]i+
_
= [ \t]*
In the grammars created before the repetition operator was added to the peggy (in 3.0.0) you could see that approach, which is equivalent of the new approach with the repetition operator, but less efficient on long lists:
list
= head:word tail:(_ "," _ @word)* { return [head, ...tail]; }
word
= $[a-z]i+
_
= [ \t]*
Note that the @
in the tail section plucks the word out of the
parentheses, NOT out of the rule itself.
Peggy Identifiers
Peggy Identifiers are used as rule names, rule references, and label names. They are used as identifiers in the code that Peggy generates (by default, JavaScript), and as such, must conform to the limitations of the Peggy grammar as well as those of the target language.
Like all Peggy grammar constructs, identifiers MUST contain only codepoints in the
Basic
Multilingual Plane. They must begin with a codepoint whose Unicode
General Category property is Lu, Ll, Lt, Lm, Lo, or Nl (letters),
"_" (underscore), or a Unicode escape in the form \uXXXX
.
Subsequent codepoints can be any of those that are valid as an initial
codepoint, "$", codepoints whose General Category property is Mn or Mc
(combining characters), Nd (numbers), Pc (connector punctuation),
"\u200C" (zero width non-joiner), or "\u200D (zero width joiner)"
Labels have a further restriction, which is that they must be valid as a function parameter in the language being generated. For JavaScript, this means that they cannot be on the limited set of JavaScript reserved words. Plugins can modify the list of reserved words at compile time.
Valid identifiers:
Foo
Bär
_foo
foo$bar
Invalid identifiers:
const
(reserved word)𐓁𐒰͘𐓐𐓎𐓊𐒷
(valid in JavaScript, but not in the Basic Multilingual Plane)$Bar
(starts with "$")foo bar
(invalid JavaScript identifier containing space)
Error Messages
As described above, you can annotate your grammar rules with human-readable names that will be used in error messages. For example, this production:
integer "simple number"
= digits:[0-9]+
will produce an error message like:
Expected simple number but "a" found.
when parsing a non-number, referencing the human-readable name "simple number." Without the human-readable name, Peggy instead uses a description of the character class that failed to match:
Expected [0-9] but "a" found.
Aside from the text content of messages, human-readable names also have a subtler effect on where errors are reported. Peggy prefers to match named rules completely or not at all, but not partially. Unnamed rules, on the other hand, can produce an error in the middle of their subexpressions.
For example, for this rule matching a comma-separated list of integers:
seq
= integer ("," integer)*
an input like 1,2,a produces this error message:
Expected integer but "a" found.
But if we add a human-readable name to the seq production:
seq "list of numbers"
= integer ("," integer)*
then Peggy prefers an error message that implies a smaller attempted parse tree:
Expected end of input but "," found.
There are two classes of errors in Peggy:
SyntaxError
: Syntax errors, found during parsing the input. This kind of errors can be thrown both during grammar parsing and during input parsing. Although name is the same, errors of each generated parser (including Peggy parser itself) has its own unique class.GrammarError
: Grammar errors, found during construction of the parser. These errors can be thrown only in the parser generation phase. This error signals a logical mistake in the grammar, such as having two rules with the same name in one grammar, etc.
By default, stringifying these errors produces an error
string without location information. These errors also have a
format()
method that produces an error string with location
information. If you provide an array of mappings from the
grammarSource
to the input string
being processed, then the formatted error string includes ASCII arrows and
underlines highlighting the error(s) in the source.
let source = ...;
try {
peggy.generate( , { grammarSource: 'recursion.pegjs', ... }); // throws SyntaxError or GrammarError
parser.parse(input, { grammarSource: 'input.js', ... }); // throws SyntaxError
} catch (e) {
if (typeof e.format === "function") {
console.log(e.format([
{ source: 'main.pegjs', text },
{ source: 'input.js', text: input },
...
]));
} else {
throw e;
}
}
Messages generated by format()
look like this
Error: Possible infinite loop when parsing (left recursion: start -> proxy -> end -> start)
--> .\recursion.pegjs:1:1
|
1 | start = proxy;
| ^^^^^
note: Step 1: call of the rule "proxy" without input consumption
--> .\recursion.pegjs:1:9
|
1 | start = proxy;
| ^^^^^
note: Step 2: call of the rule "end" without input consumption
--> .\recursion.pegjs:2:11
|
2 | proxy = a:end { return a; };
| ^^^
note: Step 3: call itself without input consumption - left recursion
--> .\recursion.pegjs:3:8
|
3 | end = !start
| ^^^^^
Error: Expected ";" or "{" but "x" found.
--> input.js:1:16
|
1 | function main()x {}
| ^
A plugin may register additional passes that can generate
GrammarError
s to report about problems, but they shouldn't do
that by throwing an instance of GrammarError
. They should use the
session API instead.
Locations
During the parsing you can access to the information of the current parse
location, such as offset in the parsed string, line and column information.
You can get this information by calling location()
function,
which returns you the following object:
{
source: options.grammarSource,
start: { offset: 23, line: 5, column: 6 },
end: { offset: 25, line: 5, column: 8 }
}
source
is the a string or object that was supplied in the
grammarSource
parser option.
For certain special cases, you can use an instance of the
GrammarLocation
class as the grammarSource
.
GrammarLocation
allows you to specify the offset of the grammar
source in another file, such as when that grammar is embedded in a larger
document.
If source
is null
or undefined
it doesn't appear in the formatted messages.
The default value for source
is undefined
.
For actions, start
refers to the position at the beginning of the preceding
expression, and end
refers to the position after the end of the preceding
expression.
For semantic predicates, start
and end
are equal, denoting the location where
the predicate is evaluated.
For the per-parse initializer, the location is the start of the input, i.e.
{
source: options.grammarSource,
start: { offset: 0, line: 1, column: 1 },
end: { offset: 0, line: 1, column: 1 }
}
offset
is a 0-based character index within the source text.
line
and column
are 1-based indices.
The line number is incremented each time the parser finds an end of line sequence in the input.
Line and column are somewhat expensive to compute, so if you just need the
offset, there's also a function offset()
that returns just the
start offset, and a function range()
that returns the object:
{
source: options.grammarSource,
start: 23,
end: 25
}
(i.e. difference from the location()
result only in type of
start
and end
properties, which contain just an
offset instead of the Location
object.)
All of the notes about values for location()
object are also
applicable to the range()
and offset()
calls.
Currently, Peggy grammars may only contain codepoints from the Basic Multilingual Plane (BMP) of Unicode. This means that all offsets are measured in UTF-16 code units. If you include characters outside this Plane (for example, emoji, or any surrogate pairs), you may get an offset inside a code point.
Changing this behavior might be a breaking change, so it will likely cause a major version number increase if it happens. You can join to the discussion for this topic on the GitHub Discussions page.
Plugins API
A plugin is an object with the use(config, options)
method.
That method will be called for all plugins in the options.plugins
array, supplied to the generate()
method.
Plugins suitable for use on the command line can be written either as CJS
or MJS modules that export a "use" function. The CLI loads plugins with
await(plugin_name)
, which should
correctly load from node_modules, a local file starting with "/" or "./",
etc. For example:
// CJS
exports.use = (config, options) => {
}
// MJS
export function use(config, options) => {
}
use
accepts these parameters:
config
Object with the following properties:
parser
Parser
object, by default thepeggy.parser
instance. That object will be used to parse the grammar. Plugin can replace this objectpasses
-
Mapping
{ [stage: string]: Pass[] }
that represents compilation stages that would applied to the AST, returned by theparser
object. That mapping will contain at least the following keys:check
— passes that check AST for correctness. They shouldn't change the ASTtransform
— passes that performs various optimizations. They can change the AST, add or remove nodes or their propertiesgenerate
— passes used for actual code generating
A plugin that implements a pass should usually push it to the end of the correct array. Each pass is a function with the signature
pass(ast, options, session)
:ast
— the AST created by theconfig.parser.parse()
methodoptions
— compilation options passed to thepeggy.compiler.compile()
method. If parser generation is started becausegenerate()
function was called that is also an options, passed to thegenerate()
methodsession
— aSession
object that allows raising errors, warnings and informational messages
reservedWords
-
String array with a list of words that shouldn't be used as label names. This list can be modified by plugins. That property is not required to be sorted or not contain duplicates, but it is recommend to remove duplicates.
Default list contains JavaScript reserved words, and can be found in the
peggy.RESERVED_WORDS
property.
options
Build options passed to the generate()
method. A best practice
for a plugin would look for its own options under a
<plugin_name>
key:
// File: foo.mjs
export function use(config, options) => {
const mine = options['foo_mine'] ?? 'my default';
}
Session API
Each compilation request is represented by a Session
instance.
An object of this class is created by the compiler and given to each pass as a
3rd parameter. The session object gives access to the various compiler
services. At the present time there is only one such service: reporting of
diagnostics.
All diagnostics are divided into three groups: errors, warnings and
informational messages. For each of them the Session
object has a
method, described below.
All reporting methods have an identical signature:
(message: string, location?: LocationRange, notes?: DiagnosticNote[]) => void;
message
: a main diagnostic messagelocation
: an optional location information if diagnostic is related to the grammar source codenotes
: an array with additional details about diagnostic, pointing to the different places in the grammar. For example, each note could be a location of a duplicated rule definition
error(...)
-
Reports an error. Compilation process is subdivided into pieces called stages and each stage consist of one or more passes. Within the one stage all errors, reported by different passes, are collected without interrupting the parsing process.
When all passes in the stage are completed, the stage is checked for errors. If one was registered, a
GrammarError
with all found problems in theproblems
property is thrown. If there are no errors, then the next stage is processed.After processing all three stages (
check
,transform
andgenerate
) the compilation process is finished.The process, described above, means that passes should be careful about what they do. For example, if you place your pass into the
check
stage there is no guarantee that all rules exists, because checking for existing rules is also performed during thecheck
stage. On the contrary, passes in thetransform
andgenerate
stages can be sure that all rules exists, because that precondition was checked on thecheck
stage. warning(...)
- Reports a warning. Warnings are similar to errors, but they do not interrupt a compilation.
info(...)
- Report an informational message. This method can be used to inform user about significant changes in the grammar, for example, replacing proxy rules.
Compatibility
Both the parser generator and generated parsers should run well in the following environments:
- Node.js 14+
- Edge
- Firefox
- Chrome
- Safari
- Opera
The generated parser is intended to run in older environments when the format chosen is "globals" or "umd". Extensive testing is NOT performed in these environments, but issues filed regarding the generated code will be fixed.