Formal Language & Grammar

A Language $L$ is a set of finite-length strings
$L \subseteq \Sigma{}^*$

where $\Sigma$ is its alphabet/vocabulary, the set of atomic symbols allowed in the language.
See footnote¹ for the meaning of ${}^*$ .

In an analogy of textual languages,
$\Sigma$ is the set of all letters in the language, and
$L$ is the set of all strings (words/sentences/paragraphs) that are grammatically correct.

To precisely define the contents of $L$ , a language also has some formation rules. These rules govern how symbols may be concatenated into well-formed (allowed) strings;
$L = \{ w \in \Sigma{}^* \mid w \ \text{is well-formed} \}$

One formal way to define well-formed-ness is via a Grammar.

Grammar

Here we speak of the specialized semi-Thue / String Rewriting System (SRS) from Noam Chomsky / Louis Hjelmslev.

A Grammar is $G$ is a 4-element tuple
$G = \braket{N, \Sigma, P, S}$

$N$ – Nonterminal symbols
finite set disjoint from $L(G)$ ²
$\Sigma$ – Terminal symbols (the alphabet)
finite set disjoint from $N$
$P$ – Production Rules
finite set of “rules” in the form of
$(\Sigma \cup N) {}^* N(\Sigma \cup N) {}^* \rightarrow (\Sigma \cup N) {}^*$
i.e., implications $\alpha \rightarrow \beta$ where $\alpha$ and $\beta$ are strings of symbols from $N \cup \Sigma$ and $\alpha$ contains at least one symbol in $N$
$S$ – Start symbol
a distinguished symbol in $N$
functions as the starting point for the application of Production Rules

Sentential Form

Here we introduce a classification, sentential form, to define the products of a grammar under the application of Production Rules, inductively:

$S$ is a sentential form.
$awb$ is a sentential form and $(w \to z) \in P$ , implies
$azb$ is a sentential form too.
( $a,b,w,z \in (N \cup \Sigma){}^*$ )

As a Binary Relation

The “operation” of a grammar is commonly defined as a relation on strings (as in an SRS).

Formally, the inductive step above is denoted as the binary relation $\underset{G}{\Rightarrow}$ . I.e.,
$x \underset{G}{\Rightarrow} y \iff \exists a, b, w, z \in (N \cup \Sigma){}^* : \ (x = awb) \land (w \to z \in P) \land (y = azb)$

Then, $\overset{*}{\underset{G}{\Rightarrow}}$ denotes the reflexive transitive closure of $\underset{G}{\Rightarrow}$ .³

As such, the set of all sentential forms is $\big\{ w \in (N \cup \Sigma){}^* \;\big|\; S \;\overset{*}{\underset{G}{\Rightarrow}}\; w \big\}$ .
Sentences are sentential forms that contain only symbols in $\Sigma$

Finally, the language of a grammar $G$ , $L(G)$ , is the set of derivable sentences.
Since $L \subseteq \Sigma^*$ and $\Sigma^* \subseteq (N \cup \Sigma){}^*$ ,
$L(G) = \Big\{ w \in \Sigma^* \;\big|\; S \xRightarrow[G]{*} w \Big\} .$

Observe that we’re ultimately just using the SRS-style binary relation as a filter in a set-builder.

Other Remarks

Intuitively, $N$ are abstract syntactic structures which “expand”, via steps in $P$ , to a string of concrete alphabet symbols ( $\Sigma$ ). $S$ is the origin of the expansion.
Then, a sentence is a “full” expansion (which can expand no more) and $L$ is all such full expansions.

If you know what an AST is, $S$ is the root of the tree, $N$ are subtrees (or nodes that denote structural sections), and $\Sigma$ are the nodes. Then $P$ is just the rules for which nodes and how nodes can be each others’ children.

Note on Grammar vs Language:

Grammar and Language do not form a one-to-one relationship. It is many-to-one.

A language defines a set of concatenations of Terminal symbols as valid.
A grammar derives this set from “rules” (remember a grammar is an elaborate set-builder), but the rules are not necessarily unique.

I.e., A language could be represented with more than one different (classes of) grammars.

E.g., for the “regular” classification:
A language is “regular” if it could be generated by a “regular” grammar; “regular” grammars always generate “regular” languages.

The Chomsky Hierarchy

Chomsky classified grammars into 4 categories based on their Production Rules.

Note: when discussing production rules, $\alpha$ and $\beta$ represent the LHS and RHS of a rule (around the implication operator in the form $\alpha \rightarrow \beta$ ).

The types are in increasing constriction on Production Rules. Of the languages generated, each type is a proper superset of the ones below.
As such, each type’s constraints include those of superclasses.

“Automata recognition” is the minimum class of automata required to verify whether a given string satisfies a grammar.

Type-0 – Recursively Enumerable

Production Rule Constraints:
None

Automata Recognition: Turing Machine

Closure: All EXCEPT complement

“Root-type”; (Chomsky) Grammar is recursively enumerable;
No constraints on Production Rules.

Observe that $\overset{*}{\underset{G}{\Rightarrow}}$ is a (halting) Turing Machine :)

Type-1 – Context-Sensitive

Production Rule Constraints:
$|\alpha| \le |\beta|$

I.e., LHS not longer than RHS

Automata Recognition: Linear-bounded

Closure: All (except string homomorphism)

Type-2 – Context-Free

Production Rule Constraints:
$\alpha \in N$

I.e., LHS is a single nonterminal symbol

Automata Recognition: (Nondeterministic) Pushdown

Closure: All EXCEPT union and intersection

Type-3 – Regular

Production Rule Constraints:
$\beta = \mathrm{n}s \lor \beta = s$
OR (but same within one grammar)
$\beta = s\mathrm{n} \lor \beta = s$
where $s \in \Sigma^*$ , $\mathrm{n} \in N$

I.e., RHS contains at most one nonterminal, as the leftmost or rightmost symbol (with the placement consistent within the grammar).

Automata Recognition: Finite-state

Closure: All

Alternatively, regular grammars are either left- or right- linear grammars.
Linear grammars have at most one nonterminal in the RHS.
Left- or right- ness is said if the position of the nonterminal is limited to the left- or right- most positions.

Regular expressions are a special notation for regular grammars.

${}^\ast$ is the Kleene Star operator.
It is a left-binding unary operator that means “0-or-more (but finite) repetitions”
On sets, that means “a sequence of 0-or-more of [any element in the set]”.
So $A^*$ (for a set $A$ ) denotes the (infinite) set of all (possibly empty) concatenations of elements in $A$ (into finite-length strings) ↩︎
this notation is explained at the end of the section ↩︎
observe that $\underset{G}{\Rightarrow}$ , intuitively, is just one application of a rule in $P$ .
Then, similar to the Kleene Star on $\Sigma$ , $x \;\overset{*}{\underset{G}{\Rightarrow}}\; y$ means " $y$ is a result of applying rules in $P$ zero-or-more times on $x$ ". ↩︎