Formal Language & Grammar

A Language LL is a set of finite-length strings
LΣL \subseteq \Sigma{}^*

where Σ\Sigma is its alphabet/vocabulary, the set of atomic symbols allowed in the language.
See footnote1 for the meaning of {}^*.

In an analogy of textual languages,
Σ\Sigma is the set of all letters in the language, and
LL is the set of all strings (words/sentences/paragraphs) that are grammatically correct.

To precisely define the contents of LL, a language also has some formation rules. These rules govern how symbols may be concatenated into well-formed (allowed) strings;
L={wΣw is well-formed}L = \{ w \in \Sigma{}^* \mid w \ \text{is well-formed} \}

One formal way to define well-formed-ness is via a Grammar.

Grammar

Here we speak of the specialized semi-Thue / String Rewriting System (SRS) from Noam Chomsky / Louis Hjelmslev.

A Grammar is GG is a 4-element tuple
G=N,Σ,P,S G = \braket{N, \Sigma, P, S}

Sentential Form

Here we introduce a classification, sentential form, to define the products of a grammar under the application of Production Rules, inductively:

As a Binary Relation

The “operation” of a grammar is commonly defined as a relation on strings (as in an SRS).

Formally, the inductive step above is denoted as the binary relation G\underset{G}{\Rightarrow}. I.e.,
xGy    a,b,w,z(NΣ): (x=awb)(wzP)(y=azb)x \underset{G}{\Rightarrow} y \iff \exists a, b, w, z \in (N \cup \Sigma){}^* : \ (x = awb) \land (w \to z \in P) \land (y = azb)

Then, G\overset{*}{\underset{G}{\Rightarrow}} denotes the reflexive transitive closure of G\underset{G}{\Rightarrow}.3

As such, the set of all sentential forms is {w(NΣ)    S  G  w}\big\{ w \in (N \cup \Sigma){}^* \;\big|\; S \;\overset{*}{\underset{G}{\Rightarrow}}\; w \big\}.
Sentences are sentential forms that contain only symbols in Σ\Sigma


Finally, the language of a grammar GG, L(G)L(G), is the set of derivable sentences.
Since LΣL \subseteq \Sigma^* and Σ(NΣ)\Sigma^* \subseteq (N \cup \Sigma){}^*,
L(G)={wΣ    SundefinedGw}. L(G) = \Big\{ w \in \Sigma^* \;\big|\; S \xRightarrow[G]{*} w \Big\} .

Observe that we’re ultimately just using the SRS-style binary relation as a filter in a set-builder.

Other Remarks

Intuitively, NN are abstract syntactic structures which “expand”, via steps in PP, to a string of concrete alphabet symbols (Σ\Sigma). SS is the origin of the expansion.
Then, a sentence is a “full” expansion (which can expand no more) and LL is all such full expansions.

If you know what an AST is, SS is the root of the tree, NN are subtrees (or nodes that denote structural sections), and Σ\Sigma are the nodes. Then PP is just the rules for which nodes and how nodes can be each others’ children.


Note on Grammar vs Language:

Grammar and Language do not form a one-to-one relationship. It is many-to-one.

A language defines a set of concatenations of Terminal symbols as valid.
A grammar derives this set from “rules” (remember a grammar is an elaborate set-builder), but the rules are not necessarily unique.

I.e., A language could be represented with more than one different (classes of) grammars.

E.g., for the “regular” classification:
A language is “regular” if it could be generated by a “regular” grammar; “regular” grammars always generate “regular” languages.

The Chomsky Hierarchy

Chomsky classified grammars into 4 categories based on their Production Rules.

Note: when discussing production rules, α\alpha and β\beta represent the LHS and RHS of a rule (around the implication operator in the form αβ\alpha \rightarrow \beta).

The types are in increasing constriction on Production Rules. Of the languages generated, each type is a proper superset of the ones below.
As such, each type’s constraints include those of superclasses.

“Automata recognition” is the minimum class of automata required to verify whether a given string satisfies a grammar.

Type-0 – Recursively Enumerable

Production Rule Constraints:
None

Automata Recognition: Turing Machine

Closure: brrrrr

“Root-type”; (Chomsky) Grammar is recursively enumerable;
No constraints on Production Rules.

Observe that G\overset{*}{\underset{G}{\Rightarrow}} is a (halting) Turing Machine :)

Type-1 – Context-Sensitive

Production Rule Constraints:
αβ|\alpha| \le |\beta|

I.e., LHS not longer than RHS

Automata Recognition: Linear-bounded

Closure

Type-2 – Context-Free

Production Rule Constraints:
αN\alpha \in N

I.e., LHS is a single nonterminal symbol

Automata Recognition: Pushdown

Closure

Type-3 – Regular

Production Rule Constraints:
β=nsβ=s\beta = \mathrm{n}s \lor \beta = s
OR (but same within one grammar)
β=snβ=s\beta = s\mathrm{n} \lor \beta = s
where sΣs \in \Sigma^*, nN\mathrm{n} \in N

I.e., RHS contains at most one nonterminal, as the leftmost or rightmost symbol (with the placement consistent within the grammar).

Automata Recognition: Finite-state

Closure

Alternatively, regular grammars are either left- or right- linear grammars.
Linear grammars have at most one nonterminal in the RHS.
Left- or right- ness is said if the position of the nonterminal is limited to the left- or right- most positions.

A regular language can be defined by both regular grammars and regular expressions.


  1. {}^\ast is the Kleene Star operator.
    It is a left-binding unary operator that means “0-or-more (but finite) repetitions
    On sets, that means “a sequence of 0-or-more of [any element in the set]”.
    So AA^* (for a set AA) denotes the (infinite) set of all (possibly empty) concatenations of elements in AA (into finite-length strings) ↩︎

  2. this notation is explained at the end of the section ↩︎

  3. observe that G\underset{G}{\Rightarrow}, intuitively, is just one application of a rule in PP.
    Then, similar to the Kleene Star on Σ\Sigma, x  G  yx \;\overset{*}{\underset{G}{\Rightarrow}}\; y means "yy is a result of applying rules in PP zero-or-more times on xx". ↩︎