Understanding the ECMAScript spec, part 3

发布时间 · 标签: ECMAScript Understanding ECMAScript

All episodes

In this episode, we’ll go deeper in the definition of the ECMAScript language and its syntax. If you’re not familiar with context-free grammars, now is a good time to check out the basics, since the spec uses context-free grammars to define the language. See the chapter about context free grammars in "Crafting Interpreters" for an approachable introduction or the Wikipedia page for a more mathematical definition.

ECMAScript grammars #

The ECMAScript spec defines four grammars:

The lexical grammar describes how Unicode code points are translated into a sequence of input elements (tokens, line terminators, comments, white space).

The syntactic grammar defines how syntactically correct programs are composed of tokens.

The RegExp grammar describes how Unicode code points are translated into regular expressions.

The numeric string grammar describes how Strings are translated into numeric values.

Each grammar is defined as a context-free grammar, consisting of a set of productions.

The grammars use slightly different notation: the syntactic grammar uses LeftHandSideSymbol : whereas the lexical grammar and the RegExp grammar use LeftHandSideSymbol :: and the numeric string grammar uses LeftHandSideSymbol :::.

Next we’ll look into the lexical grammar and the syntactic grammar in more detail.

Lexical grammar #

The spec defines ECMAScript source text as a sequence of Unicode code points. For example, variable names are not limited to ASCII characters but can also include other Unicode characters. The spec doesn’t talk about the actual encoding (for example, UTF-8 or UTF-16). It assumes that the source code has already been converted into a sequence of Unicode code points according to the encoding it was in.

It’s not possible to tokenize ECMAScript source code in advance, which makes defining the lexical grammar slightly more complicated.

For example, we cannot determine whether / is the division operator or the start of a RegExp without looking at the larger context it occurs in:

const x = 10 / 5;

Here / is a DivPunctuator.

const r = /foo/;

Here the first / is the start of a RegularExpressionLiteral.

Templates introduce a similar ambiguity — the interpretation of }` depends on the context it occurs in:

const what1 = 'temp';
const what2 = 'late';
const t = `I am a ${ what1 + what2 }`;

Here `I am a ${ is TemplateHead and }` is a TemplateTail.

if (0 == 1) {
}`not very useful`;

Here } is a RightBracePunctuator and ` is the start of a NoSubstitutionTemplate.

Even though the interpretation of / and }` depends on their “context” — their position in the syntactic structure of the code — the grammars we’ll describe next are still context-free.

The lexical grammar uses several goal symbols to distinguish between the contexts where some input elements are permitted and some are not. For example, the goal symbol InputElementDiv is used in contexts where / is a division and /= is a division-assignment. The InputElementDiv productions list the possible tokens which can be produced in this context:

InputElementDiv ::
WhiteSpace
LineTerminator
Comment
CommonToken
DivPunctuator
RightBracePunctuator

In this context, encountering / produces the DivPunctuator input element. Producing a RegularExpressionLiteral is not an option here.

On the other hand, InputElementRegExp is the goal symbol for the contexts where / is the beginning of a RegExp:

InputElementRegExp ::
WhiteSpace
LineTerminator
Comment
CommonToken
RightBracePunctuator
RegularExpressionLiteral

As we see from the productions, it’s possible that this produces the RegularExpressionLiteral input element, but producing DivPunctuator is not possible.

Similarly, there is another goal symbol, InputElementRegExpOrTemplateTail, for contexts where TemplateMiddle and TemplateTail are permitted, in addition to RegularExpressionLiteral. And finally, InputElementTemplateTail is the goal symbol for contexts where only TemplateMiddle and TemplateTail are permitted but RegularExpressionLiteral is not permitted.

In implementations, the syntactic grammar analyzer (“parser”) may call the lexical grammar analyzer (“tokenizer” or “lexer”), passing the goal symbol as a parameter and asking for the next input element suitable for that goal symbol.

Syntactic grammar #

We looked into the lexical grammar, which defines how we construct tokens from Unicode code points. The syntactic grammar builds on it: it defines how syntactically correct programs are composed of tokens.

Example: Allowing legacy identifiers #

Introducing a new keyword to the grammar is a possibly breaking change — what if existing code already uses the keyword as an identifier?

For example, before await was a keyword, someone might have written the following code:

function old() {
var await;
}

The ECMAScript grammar carefully added the await keyword in such a way that this code continues to work. Inside async functions, await is a keyword, so this doesn’t work:

async function modern() {
var await; // Syntax error
}

Allowing yield as an identifier in non-generators and disallowing it in generators works similarly.

Understanding how await is allowed as an identifier requires understanding ECMAScript-specific syntactic grammar notation. Let’s dive right in!

Productions and shorthands #

Let’s look at how the productions for VariableStatement are defined. At the first glance, the grammar can look a bit scary:

VariableStatement[Yield, Await] :
var VariableDeclarationList[+In, ?Yield, ?Await] ;

What do the subscripts ([Yield, Await]) and prefixes (+ in +In and ? in ?Async) mean?

The notation is explained in the section Grammar Notation.

The subscripts are a shorthand for expressing a set of productions, for a set of left-hand side symbols, all at once. The left-hand side symbol has two parameters, which expands into four "real" left-hand side symbols: VariableStatement, VariableStatement_Yield, VariableStatement_Await, and VariableStatement_Yield_Await.

Note that here the plain VariableStatement means “VariableStatement without _Await and _Yield”. It should not be confused with VariableStatement[Yield, Await].

On the right-hand side of the production, we see the shorthand +In, meaning "use the version with _In", and ?Await, meaning “use the version with _Await if and only if the left-hand side symbol has _Await” (similarly with ?Yield).

The third shorthand, ~Foo, meaning “use the version without _Foo”, is not used in this production.

With this information, we can expand the productions like this:

VariableStatement :
var VariableDeclarationList_In ;

VariableStatement_Yield :
var VariableDeclarationList_In_Yield ;

VariableStatement_Await :
var VariableDeclarationList_In_Await ;

VariableStatement_Yield_Await :
var VariableDeclarationList_In_Yield_Await ;

Ultimately, we need to find out two things:

  1. Where is it decided whether we’re in the case with _Await or without _Await?
  2. Where does it make a difference — where do the productions for Something_Await and Something (without _Await) diverge?

_Await or no _Await? #

Let’s tackle question 1 first. It’s somewhat easy to guess that non-async functions and async functions differ in whether we pick the parameter _Await for the function body or not. Reading the productions for async function declarations, we find this:

AsyncFunctionBody :
FunctionBody[~Yield, +Await]

Note that AsyncFunctionBody has no parameters — they get added to the FunctionBody on the right-hand side.

If we expand this production, we get:

AsyncFunctionBody :
FunctionBody_Await

In other words, async functions have FunctionBody_Await, meaning a function body where await is treated as a keyword.

On the other hand, if we’re inside a non-async function, the relevant production is:

FunctionDeclaration[Yield, Await, Default] :
function BindingIdentifier[?Yield, ?Await] ( FormalParameters[~Yield, ~Await] ) { FunctionBody[~Yield, ~Await] }

(FunctionDeclaration has another production, but it’s not relevant for our code example.)

To avoid combinatorial expansion, let’s ignore the Default parameter which is not used in this particular production.

The expanded form of the production is:

FunctionDeclaration :
function BindingIdentifier ( FormalParameters ) { FunctionBody }

FunctionDeclaration_Yield :
function BindingIdentifier_Yield ( FormalParameters ) { FunctionBody }

FunctionDeclaration_Await :
function BindingIdentifier_Await ( FormalParameters ) { FunctionBody }

FunctionDeclaration_Yield_Await :
function BindingIdentifier_Yield_Await ( FormalParameters ) { FunctionBody }

In this production we always get FunctionBody and FormalParameters (without _Yield and without _Await), since they are parameterized with [~Yield, ~Await] in the non-expanded production.

Function name is treated differently: it gets the parameters _Await and _Yield if the left-hand side symbol has them.

To summarize: Async functions have a FunctionBody_Await and non-async functions have a FunctionBody (without _Await). Since we’re talking about non-generator functions, both our async example function and our non-async example function are parameterized without _Yield.

Maybe it’s hard to remember which one is FunctionBody and which FunctionBody_Await. Is FunctionBody_Await for a function where await is an identifier, or for a function where await is a keyword?

You can think of the _Await parameter meaning "await is a keyword". This approach is also future proof. Imagine a new keyword, blob being added, but only inside "blobby" functions. Non-blobby non-async non-generators would still have FunctionBody (without _Await, _Yield or _Blob), exactly like they have now. Blobby functions would have a FunctionBody_Blob, async blobby functions would have FunctionBody_Await_Blob and so on. We’d still need to add the Blob subscript to the productions, but the expanded forms of FunctionBody for already existing functions stay the same.

Disallowing await as an identifier #

Next, we need to find out how await is disallowed as an identifier if we're inside a FunctionBody_Await.

We can follow the productions further to see that the _Await parameter gets carried unchanged from FunctionBody all the way to the VariableStatement production we were previously looking at.

Thus, inside an async function, we’ll have a VariableStatement_Await and inside a non-async function, we’ll have a VariableStatement.

We can follow the productions further and keep track of the parameters. We already saw the productions for VariableStatement:

VariableStatement[Yield, Await] :
var VariableDeclarationList[+In, ?Yield, ?Await] ;

All productions for VariableDeclarationList just carry the parameters on as is:

VariableDeclarationList[In, Yield, Await] :
VariableDeclaration[?In, ?Yield, ?Await]

(Here we show only the production relevant to our example.)

VariableDeclaration[In, Yield, Await] :
BindingIdentifier[?Yield, ?Await] Initializer[?In, ?Yield, ?Await] opt

The opt shorthand means that the right-hand side symbol is optional; there are in fact two productions, one with the optional symbol, and one without.

In the simple case relevant to our example, VariableStatement consists of the keyword var, followed by a single BindingIdentifier without an initializer, and ending with a semicolon.

To disallow or allow await as a BindingIdentifier, we hope to end up with something like this:

BindingIdentifier_Await :
Identifier
yield

BindingIdentifier :
Identifier
yield
await

This would disallow await as an identifier inside async functions and allow it as an identifier inside non-async functions.

But the spec doesn’t define it like this, instead we find this production:

BindingIdentifier[Yield, Await] :
Identifier
yield
await

Expanded, this means the following productions:

BindingIdentifier_Await :
Identifier
yield
await

BindingIdentifier :
Identifier
yield
await

(We’re omitting the productions for BindingIdentifier_Yield and BindingIdentifier_Yield_Await which are not needed in our example.)

This looks like await and yield would be always allowed as identifiers. What’s up with that? Is the whole blog post for nothing?

Statics semantics to the rescue #

It turns out that static semantics are needed for forbidding await as an identifier inside async functions.

Static semantics describe static rules — that is, rules that are checked before the program runs.

In this case, the static semantics for BindingIdentifier define the following syntax-directed rule:

BindingIdentifier[Yield, Await] : await

It is a Syntax Error if this production has an [Await] parameter.

Effectively, this forbids the BindingIdentifier_Await : await production.

The spec explains that the reason for having this production but defining it as a Syntax Error by the static semantics is because of interference with automatic semicolon insertion (ASI).

Remember that ASI kicks in when we’re unable to parse a line of code according to the grammar productions. ASI tries to add semicolons to satisfy the requirement that statements and declarations must end with a semicolon. (We’ll describe ASI in more detail in a later episode.)

Consider the following code (example from the spec):

async function too_few_semicolons() {
let
await 0;
}

If the grammar disallowed await as an identifier, ASI would kick in and transform the code into the following grammatically correct code, which also uses let as an identifier:

async function too_few_semicolons() {
let;
await 0;
}

This kind of interference with ASI was deemed too confusing, so static semantics were used for disallowing await as an identifier.

Disallowed StringValues of identifiers #

There’s also another related rule:

BindingIdentifier : Identifier

It is a Syntax Error if this production has an [Await] parameter and StringValue of Identifier is "await".

This might be confusing at first. Identifier is defined like this:

Identifier :
IdentifierName but not ReservedWord

await is a ReservedWord, so how can an Identifier ever be await?

As it turns out, Identifier cannot be await, but it can be something else whose StringValue is "await" — a different representation of the character sequence await.

Static semantics for identifier names define how the StringValue of an identifier name is computed. For example, the Unicode escape sequence for a is \u0061, so \u0061wait has the StringValue "await". \u0061wait won’t be recognized as a keyword by the lexical grammar, instead it will be an Identifier. The static semantics for forbid using it as a variable name inside async functions.

So this works:

function old() {
var \u0061wait;
}

And this doesn’t:

async function modern() {
var \u0061wait; // Syntax error
}

Summary #

In this episode, we familiarized ourselves with the lexical grammar, the syntactic grammar, and the shorthands used for defining the syntactic grammar. As an example, we looked into forbidding using await as an identifier inside async functions but allowing it inside non-async functions.

Other interesting parts of the syntactic grammar, such as automatic semicolon insertion and cover grammars will be covered in a later episode. Stay tuned!