A Language is a set of finite-length strings
where is its alphabet/vocabulary, the set of atomic symbols allowed in the language.
See footnote1 for the meaning of .
In an analogy of textual languages,
is the set of all letters in the language, and
is the set of all strings (words/sentences/paragraphs) that are grammatically correct.
To precisely define the contents of , a language also has some formation rules. These rules govern how symbols may be concatenated into well-formed (allowed) strings;
One formal way to define well-formed-ness is via a Grammar.
Here we speak of the specialized semi-Thue / String Rewriting System (SRS) from Noam Chomsky / Louis Hjelmslev.
A Grammar is is a 4-element tuple
– Nonterminal symbols
finite set disjoint from 2
– Terminal symbols (the alphabet)
finite set disjoint from
– Production Rules
finite set of “rules” in the form of
i.e., implications where and are strings of symbols from and contains at least one symbol in
– Start symbol
a distinguished symbol in
functions as the starting point for the application of Production Rules
Here we introduce a classification, sentential form, to define the products of a grammar under the application of Production Rules, inductively:
The “operation” of a grammar is commonly defined as a relation on strings (as in an SRS).
Formally, the inductive step above is denoted as the binary relation . I.e.,
Then, denotes the reflexive transitive closure of .3
As such, the set of all sentential forms is .
Sentences are sentential forms that contain only symbols in
Finally, the language of a grammar , , is the set of derivable sentences.
Since and ,
Observe that we’re ultimately just using the SRS-style binary relation as a filter in a set-builder.
Intuitively, are abstract syntactic structures which “expand”, via steps in , to a string of concrete alphabet symbols (). is the origin of the expansion.
Then, a sentence is a “full” expansion (which can expand no more) and is all such full expansions.
If you know what an AST is, is the root of the tree, are subtrees (or nodes that denote structural sections), and are the nodes. Then is just the rules for which nodes and how nodes can be each others’ children.
Note on Grammar vs Language:
Grammar and Language do not form a one-to-one relationship. It is many-to-one.
A language defines a set of concatenations of Terminal symbols as valid.
A grammar derives this set from “rules” (remember a grammar is an elaborate set-builder), but the rules are not necessarily unique.
I.e., A language could be represented with more than one different (classes of) grammars.
E.g., for the “regular” classification:
A language is “regular” if it could be generated by a “regular” grammar; “regular” grammars always generate “regular” languages.
Chomsky classified grammars into 4 categories based on their Production Rules.
Note: when discussing production rules, and represent the LHS and RHS of a rule (around the implication operator in the form ).
The types are in increasing constriction on Production Rules. Of the languages generated, each type is a proper superset of the ones below.
As such, each type’s constraints include those of superclasses.
“Automata recognition” is the minimum class of automata required to verify whether a given string satisfies a grammar.
Production Rule Constraints:
None
Automata Recognition: Turing Machine
Closure: brrrrr
“Root-type”; (Chomsky) Grammar is recursively enumerable;
No constraints on Production Rules.
Observe that is a (halting) Turing Machine :)
Production Rule Constraints:
I.e., LHS not longer than RHS
Automata Recognition: Linear-bounded
Closure
Production Rule Constraints:
I.e., LHS is a single nonterminal symbol
Automata Recognition: Pushdown
Closure
Production Rule Constraints:
OR (but same within one grammar)
where ,
I.e., RHS contains at most one nonterminal, as the leftmost or rightmost symbol (with the placement consistent within the grammar).
Automata Recognition: Finite-state
Closure
Alternatively, regular grammars are either left- or right- linear grammars.
Linear grammars have at most one nonterminal in the RHS.
Left- or right- ness is said if the position of the nonterminal is limited to the left- or right- most positions.
A regular language can be defined by both regular grammars and regular expressions.
is the Kleene Star operator.
It is a left-binding unary operator that means “0-or-more (but finite) repetitions”
On sets, that means “a sequence of 0-or-more of [any element in the set]”.
So (for a set ) denotes the (infinite) set of all (possibly empty) concatenations of elements in (into finite-length strings) ↩︎
this notation is explained at the end of the section ↩︎
observe that , intuitively, is just one application of a rule in .
Then, similar to the Kleene Star on , means " is a result of applying rules in zero-or-more times on ". ↩︎