Study notes on Parsing

By BYJU'S Exam Prep

Updated on: September 25th, 2023

Syntax Analyzer (Parser)

Syntax analyzer creates the syntactic structure of the given source program. This syntactic structure is mostly a parse tree. The syntax of programming is described by Context-Free Grammar (CFG). We will use BNF (Backus-Naur Form) notation in the description of CFGs.

Download Formulas for GATE Computer Science Engineering – Programming & Data Structures

Syntax Analyzer (Parser)

The syntax analyzer (parser) checks whether a given source program satisfies the rules implied by a context-free grammar or not. If it satisfies, the parser creates the parse tree of that program. Otherwise the parser gives the error messages.

What syntax analysis cannot do!

To check whether variables are of types on which operations are allowed
To check whether a variable has been declared before use
To check whether a variable has been initialized
These issues will be handled in semantic analysis

Download Formulas for GATE Computer Science Engineering – Algorithms

We categorise the parser into two groups

Top-down parser (starts from the root).
Bottom-up parser (starts from the leaf).

Both top-down and bottom-up parsers scan the input from left-to-right (one symbol at a time).
Efficient top-down and bottom-up parsers can be implemented only for subclasses of context-free grammars.

LL for top-down parsing
LR for bottom-up parsing)

Download Formulas for GATE Computer Science Engineering – Algorithms

Example: Consider the following grammar

E ::= E + T | E – T | T T ::= T * F | T / F | F F ::= num | id

For the input string: id(x) + num(2) * id(y)

Analysis of the top-down parsing:

E => E + T

=> E + T * F

=> T + T * F

=> T + F * F

=> T + num * F

=> F + num * F

=> id + num * F

=> id + num * id

Top down parsing uses left most derivation to derive the string and uses substitutions during derivation process.

Analysis of the Bottom-up parsing:

id(x) + num(2) * id(y)

=> id(x) + num(2) * F

=> id(x) + F * F

=> id(x) + T * F

=> id(x) + T

=> F + T

=> T + T

=> E + T

=> E

Bottom up parsing uses reverse of right most derivation to verify the string and uses reductions during the process.

Context-Free Grammars

Inherently recursive structures of a programming language are defined by a CFG. In a CFG, we have A start symbol (one of the non-terminals). A finite set of terminals (in our case, this will be the set of tokens). A set of non-terminals (syntactic variables).

A finite set of production rules in the following form

A → α, where A is non-terminal and a is a string of terminals (including the empty string).

Parse Trees

Graphical representation for a derivation that filters out the order of choosing non-terminals to avoid rewriting. The root node represents the start symbol, inner nodes of a parse tree are non-terminal Symbol.

Ambiguity

A grammar produces more than one parse tree for a sentence is called as ambiguous grammar. Unambiguous grammar refers unique selection of the parse tree for a sentence.

Ambiguity elimination:

. Ambiguity is problematic because meaning of the programs can be incorrect

. Ambiguity can be handled in several ways

Enforce associativity and precedence

Rewrite the grammar (cleanest way)

. There are no general techniques for handling ambiguity

. It is impossible to convert automatically an ambiguous grammar to an unambiguous one

Left Recursion

A grammar is left recursive, if it has a non-terminal A such that there is a derivation.

A ⇒ Aα for some string α

The left-recursion may appear in a single step of the derivation (immediate left recursion) or may appear in more than one step of the derivation.

A top down parser with production A → A α may loop forever

From the grammar A → A α | b left recursion may be eliminated by transforming the grammar to

A → b R

R → α R | ε

Left recursion is an issue of concern in top down parsers. A grammar is left-recursive if we can find some non-terminal A which will eventually derive a sentential form with itself as the left-symbol. In other words, a grammar is left recursive if it has a non-terminal A such that there is a derivation

A → + A a for some string a. These derivations may lead to an infinite loop.

Top-down parsing technique can’t handle left recursive grammars. So, we have to convert our left recursive grammar into an equivalent grammar which is not left recursive.

Removal of left recursion

In general

A → Aα₁|Aα₂|…|Aα_m

|β₁|β₂|…|β_n

Transforms to

A → β₁A’|β₂A’|….|β_nA”

A → α₁A’|α₂A’|…|α_mA’|€

Left Factoring

A predictive parser (a top-down parser without backtracking) insists that the grammar must be left factored.

grammar → a new equivalent grammar suitable for predictive parsing.

stmt → if expr then stmt else stmt | if expr then stmt

When we see, if we can’t know which production rule is to be chosen then rewrite stmt in the derivation,

In general,

where α is not empty and the first symbols of β₁ and β₂ (if they have one) are different.

When processing α, we can’t know whether expand

A → αβ₁| αβ₂

But, if we rewrite the grammar as follows

A → αA′ A′ → β₁ | β₂, so we can immediately expand A to αA′.

Dangling else problem can be handled by left factoring

stmt → if expr then stmt else stmt | if expr then stmt

can be transformed to

stmt → if expr then stmt S’

S’ → else stmt | ε

Top-down Parsing

There are two main techniques to achieve top-down parse tree

Recursive descent parsing
Predictive parsing

Recursive Descent Parsing (Uses Backtracking)

Backtracking is needed (if a choice of a production rule does not work, we backtrack to try other alternatives). It tries to find the left most derivation. It is not efficient.

e.g., If the grammar is

S → aBc

B → bc|b and the input is abc

Predictive Parser

. A non recursive top down parsing method

. Parser “predicts” which production to use

. It removes backtracking by fixing one production for every non-terminal and input token(s)

. Predictive parsers accept LL(k) languages

First L stands for left to right scan of input

Second L stands for leftmost derivation

k stands for number of lookahead token

. In practice LL(1) is used

Functions used in Constructing LL (1) Parsing Tables

Two functions are used in the construction of LL (1) parsing tables: FIRST and FOLLOW.
FIRST (α) is a set of the terminal symbols which occur as first symbols in strings derived from α, where α is any string of grammar symbols. If α derives to â, then ∈ is also in FIRST (α).
FOLLOW (A) is the set of the terminals which occur immediately after (FOLLOW) the non-terminal A in the strings derived from the starting symbol.
First set is computed for all non-terminals, but follow set is computed only for those non-terminals in their first set contain epsilon.
For every terminal x in FIRST(X), there is an entry (production which derives x) in LL(1) table when x is not an epsilon.
For every terminal y in Follow(Y), there is an entry (null production) in the table.

To Compute FIRST of any String X

If X is a terminal symbol → FIRST (X) = {X}
If X is a non-terminal symbol and X → ε is a production rule → ε is in FIRST (X).
If X is a non-terminal symbol and X → Y₁, Y₂, …. , Y_n is a production rule. If a terminal a in FIRST (Y_j) and ε is in all FIRST (Y_j) for j = 1, … , i -1, then a is in FIRST (X). If ε is in all FIRST (Y_j) for j = 1,… n, then ε is in FIRST (X).
If X is ε, then FIRST (X)= {∈ }
If X is Y₁, Y₂, … Y_nIf a terminal a in FIRST (Y_i) and ∈ is in all FIRST (Y_j) for j = 1,… i -1, then a is in FIRST (X). If ∈ is in all FIRST (Y_j) for j =1,..n, then ∈ is in FIRST (X).

Example:

For the expression grammar

E → T E’

E’ →+T E’ | ε

T → F T’

T’ → * F T’ | ε

F → ( E ) | id

First(E) = First(T) = First(F) = { (, id }

First(E’) = {+, ε }

First(T’) = { *, ε }

To compute FOLLOW (for Non-terminals):

If S is the start symbol, $ is in FOLLOW (S).

If A → αBβ is a production rule, then everything in FIRST (β) is FOLLOW (B) except ∈.
If (A → αB is a production rule) or (A → αBβ is a production rule and ∈ is in FIRST (β) then everything in FOLLOW (A) is in FOLLOW (B).
Apply these rules until nothing more can be added to any FOLLOW set.

LL(1) Parsing algorithm

The parsing table is a two-dimensional array M [X,a] , where X is a non-terminal, and a is a terminal or the symbol $.
The parser considers ‘X’ the symbol on top of stack, and ‘a’ the current input symbol
These two symbols determine the action to be taken by the parser
Assume that ‘$’ is a special token that is at the bottom of the stack and terminates the input string.

If X = a = $, the parser halts and announces successful completion of parsing.
If X = a ≠ $, the parser pops X off the stack and advances the input pointer to the next input symbol.
If X is a nonterminal, the program consults entry M[X,a] of the parsing table M. This entry will be either an X-production of the grammar or an error entry. If, for example, M[X,a] = {X → UVW}, the parser replaces X on top of the stack by UVW (with U on the top). If M[X,a] = error, the parser calls an error recovery routine.

Example: Consider the grammar

E → T E’

E’ → +T E’ | ε

T → F T’

T’ → * F T’ | ε

F → ( E ) | id

Parse table for the grammar is given below:

For the above grammar and parsing table, we verify the string “id + id * id” in the following way with the help of parsing algorithm.

Bottom-up Parsing Techniques

A bottom-up parser creates the parse tree of the given input string from leaves towards the root. A bottom-up parser tries to find the right most derivation of the given input in the reverse order.

Bottom-up parsing is also known as shift reduce parsing.

A more powerful parsing technique
LR grammars – more expensive than LL
Can handle left recursive grammars
Can handle virtually all the programming languages
Natural expression of programming language syntax
Automatic generation of parsers (Yacc, Bison etc.)
Detects errors as soon as possible
Allows better error recovery

Shift Reduce Parsing: A shift reduce parser tries to reduce the given input string into the starting symbol. At each reduction step, a substring of the input matching to the right side of a production rule is replaced by the non-terminal at the left side of that production rule.

Handle: A handle of a string is a substring that matches the right side of a production rule.

Handles always appear at the top of the stack and never inside it.
This makes stack a suitable data structure.

Actions: There are four possible actions of a shift reduce parser

Shift: The next input symbol is shifted onto the top of the stack.
Reduce: Replace the handle on the top of the stack by the non-terminal.
Accept: Successful completion of parsing.
Error: Parser discovers a syntax error and calls an error recovery routine.

Conflicts During Shift Reduce Parsing

There are CFGs for which shift reduce parser can’t be used. Stack contents and the next input symbol may not decide action.

The general shift-reduce technique is:

if there is no handle on the stack then shift
If there is a handle then reduce However, what happens when there is a choice?
What action to take in case both shift and reduce are valid?

Shift/Reduce Conflict: Whether make a shift operation or a reduction.

Reduce/Reduce Conflict: The parser can’t decide which of several reductions to make.

Types of Shift Reduce Parsing: There are two main categories of shift reduce parsers

Operator Precedence Parser: Simple, but supports only a small class of grammars.

LR Parsers:

LR parsers accept LR(k) languages L stands for left to right scan of input R stands for rightmost derivation k stands for number of lookahead tokens

Types of LR Parsers:

SLR (Simple) LR parser
CLR (general) LR parser (canonical LR)
LALR (Intermediate) LR parser (look-ahead LR)

SLR, CLR and LALR work in same way, but their parsing tables may different.

Relative power of various classes :

SLR(1) ≤ LALR(1) ≤ LR(1)

SLR(k) ≤ LALR(k) ≤ LR(k)

LL(k)≤ LR(k)

SLR (1) < LALR (1) < LR (1)

SLR (k) < LALR (1) < LR (k)

LL (k) < LR (k)

LR parsing: LR parsing is most general non-back tracking shift reduce parsing. The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive parsers.

LL (1) grammars ⊆ LR (1)) grammars

An LR parser can detect a syntactic error as soon as it is possible.

A configuration of a LR parsing is

(S₀X₁ S₁ … X_m S_m, a_i a_i-1 … a_n $)

Stack Rest of input

S_m and a_i decides the parser action by consulting the parsing action table (initial stack contains just S₀).
A configuration of a LR parsing represents the right sentential form

X₁ …. X_m a_i a_i-1 … a_n $

LR Parser Actions

Shift S: Shift the next input symbol and the state S onto the stack

(S₀X₁ S₁ … X_m S_m, a_i a_i-1 … a_n $) → (S₀X₁ S₁ … X_m S_m, a_iS, a_i-1 … a_n $)

Reduce A → β: Pop 2|β| (= r) items from the stack; let us assume that β = Y_l, Y₂ … , Y_r

Then, push A and S, where S = goto [S_{m – r} , A]

(S₀X₁ S₁ … X_m S_m, a_i a_i+1 … a_n $) → (S₀X₁ S₁ … X_m-r S_m-r, AS, a_i, a_i-1 … a_n $)

Accept: Parsing successfully completed.

Error: Parser detected an error (an empty entry in the action table).

Example:

Consider the grammar And its parse table E E + T | T

T → T * F | F F → ( E ) | id

Parse id + id * id using the given grammar and bottom up parsing table.

Answer:

Operator Precedence Parsing

In an operator grammar, no production rule can have E at the right side and two adjacent non-terminals at the right side.

Precedence Relations: In operator precedence parsing, we define three disjoint precedence relations between certain pair of terminals.

a < b, b has higher precedence than a.

a = b, b has same precedence as a.

a > b, b has lower precedence than a.

The determination of correct precedence relation between terminals are based on the traditional notions of associativity and precedence of operator

You can follow the detailed champion study plan for GATE CS 2022 from the following link:

Study notes on Parsing

Syntax Analyzer (Parser)

Syntax Analyzer (Parser)

Context-Free Grammars

Parse Trees

Ambiguity

Left Recursion

Removal of left recursion

Left Factoring

Top-down Parsing

Recursive Descent Parsing (Uses Backtracking)

Predictive Parser

Functions used in Constructing LL (1) Parsing Tables

Conflicts During Shift Reduce Parsing

LR Parser Actions

Operator Precedence Parsing

Detailed GATE CSE 2022 Champion Study Plan

Click Here to Avail GATE CSE Test Series!(100+ Mock Tests)

Click here to avail Online Classroom Program for Computer Science Engineering

Recent Posts