# Study notes on Parsing

By BYJU'S Exam Prep

Updated on: September 25th, 2023

**Syntax Analyzer (Parser)**

Syntax analyzer creates the syntactic structure of the given source program. This syntactic structure is mostly a parse tree. The syntax of programming is described by **Context-Free Grammar** (CFG). We will use BNF (Backus-Naur Form) notation in the description of CFGs.

Download Formulas for GATE Computer Science Engineering – Programming & Data Structures

**Syntax Analyzer (Parser)**

Syntax analyzer creates the syntactic structure of the given source program. This syntactic structure is mostly a parse tree. The syntax of programming is described by **Context-Free Grammar** (CFG). We will use BNF (Backus-Naur Form) notation in the description of CFGs.

The syntax analyzer (parser) checks whether a given source program satisfies the rules implied by a context-free grammar or not. If it satisfies, the parser creates the parse tree of that program. Otherwise the parser gives the error messages.

**What syntax analysis cannot do!**

- To check whether variables are of types on which operations are allowed
- To check whether a variable has been declared before use
- To check whether a variable has been initialized
- These issues will be handled in semantic analysis

Download Formulas for GATE Computer Science Engineering – Algorithms

We categorise the parser into two groups

- Top-down parser (starts from the root).
- Bottom-up parser (starts from the leaf).

- Both top-down and bottom-up parsers scan the input from left-to-right (one symbol at a time).
- Efficient top-down and bottom-up parsers can be implemented only for subclasses of context-free grammars.

- LL for top-down parsing
- LR for bottom-up parsing)

Download Formulas for GATE Computer Science Engineering – Algorithms

**Example: Consider the following grammar**

E ::= E + T | E – T | T T ::= T * F | T / F | F F ::= num | id

**For the input string:** id(x) + num(2) * id(y)

**Analysis of the top-down parsing:**

E => E + T

=> E + T * F

=> T + T * F

=> T + F * F

=> T + num * F

=> F + num * F

=> id + num * F

=> id + num * id

Top down parsing uses **left most derivation** to derive the string and uses **substitutions** during derivation process.

**Analysis of the Bottom-up parsing:**

id(x) + num(2) * id(y)

=> id(x) + num(2) * F

=> id(x) + F * F

=> id(x) + T * F

=> id(x) + T

=> F + T

=> T + T

=> E + T

=> E

Bottom up parsing uses **reverse of right most derivation** to verify the string and uses **reductions** during the process.

**Context-Free Grammars**

Inherently recursive structures of a programming language are defined by a CFG. In a CFG, we have A start symbol (one of the non-terminals). A finite set of terminals (in our case, this will be the set of tokens). A set of non-terminals (syntactic variables).

A finite set of production rules in the following form

A → α, where A is non-terminal and a is a string of terminals (including the empty string).

**Parse Trees**

Graphical representation for a derivation that filters out the order of choosing non-terminals to avoid rewriting. The root node represents the start symbol, inner nodes of a parse tree are non-terminal Symbol.

**Ambiguity**

A grammar produces more than one parse tree for a sentence is called as **ambiguous grammar**. **Unambiguous grammar** refers unique selection of the parse tree for a sentence.

Ambiguity elimination:

. Ambiguity is problematic because meaning of the programs can be incorrect

. Ambiguity can be handled in several ways

Enforce associativity and precedence

Rewrite the grammar (cleanest way)

. There are no general techniques for handling ambiguity

. It is impossible to convert automatically an ambiguous grammar to an unambiguous one

**Left Recursion**

A grammar is left recursive, if it has a non-terminal A such that there is a derivation.

A ⇒ Aα for some string α

The left-recursion may appear in a single step of the derivation (immediate left recursion) or may appear in more than one step of the derivation.

A top down parser with production A → A α may loop forever

From the grammar A → A α | b left recursion may be eliminated by transforming the grammar to

A → b R

R → α R | ε

**Left recursion is an issue of concern in top down parsers. A grammar is left-recursive if we can find some non-terminal A which will eventually derive a sentential form with itself as the left-symbol. In other words, a grammar is left recursive if it has a non-terminal A such that there is a derivation**

**A → + A a for some string a. These derivations may lead to an infinite loop.**

Top-down parsing technique can’t handle left recursive grammars. So, we have to convert our left recursive grammar into an equivalent grammar which is not left recursive.

**Removal of left recursion**

**In general**

A → Aα_{1}|Aα_{2}|…|Aα_{m}

|β_{1}|β_{2}|…|β_{n}

**Transforms to**

A → β_{1}A’|β_{2}A’|….|β_{n}A”

A → α_{1}A’|α_{2}A’|…|α_{m}A’|€

**Left Factoring**

A predictive parser (a top-down parser without backtracking) insists that the grammar must be left factored.

grammar → a new equivalent grammar suitable for predictive parsing.

stmt → if expr then stmt else stmt | if expr then stmt

When we see, if we can’t know which production rule is to be chosen then rewrite stmt in the derivation,

In general,

where α is not empty and the first symbols of β_{1} and β_{2} (if they have one) are different.

When processing α, we can’t know whether expand

A → αβ_{1 }| αβ_{2}

But, if we rewrite the grammar as follows

A → αA′ A′ → β_{1} | β_{2}, so we can immediately expand A to αA′.

**Dangling else problem can be handled by left factoring**

stmt → if expr then stmt else stmt | if expr then stmt

can be transformed to

stmt → if expr then stmt S’

S’ → else stmt | ε

**Top-down Parsing**

There are two main techniques to achieve top-down parse tree

- Recursive descent parsing
- Predictive parsing

**Recursive Descent Parsing (Uses Backtracking)**

Backtracking is needed (if a choice of a production rule does not work, we backtrack to try other alternatives). It tries to find the left most derivation. It is not efficient.

e.g., If the grammar is

S → aBc

B → bc|b and the input is abc

**Predictive Parser**

**. **A non recursive top down parsing method

**. ****Parser “predicts” which production to use**

**. ****It removes backtracking by fixing one production for every non-terminal and input token(s)**

**. ****Predictive parsers accept LL(k) languages**

First L stands for left to right scan of input

Second L stands for leftmost derivation

k stands for number of lookahead token

**. ****In practice LL(1) is used**

**Functions used in Constructing LL (1) Parsing Tables**

- Two functions are used in the construction of LL (1) parsing tables: FIRST and FOLLOW.
- FIRST (α) is a set of the terminal symbols which occur as first symbols in strings derived from α, where α is any string of grammar symbols. If α derives to â, then ∈ is also in FIRST (α).
- FOLLOW (A) is the set of the terminals which occur immediately after (FOLLOW) the non-terminal A in the strings derived from the starting symbol.
- First set is computed for all non-terminals, but follow set is computed only for those non-terminals in their first set contain epsilon.
- For every terminal x in FIRST(X), there is an entry (production which derives x) in LL(1) table when x is not an epsilon.
- For every terminal y in Follow(Y), there is an entry (null production) in the table.

**To Compute FIRST of any String X**

- If X is a terminal symbol → FIRST (X) = {X}
- If X is a non-terminal symbol and X → ε is a production rule → ε is in FIRST (X).
- If X is a non-terminal symbol and X → Y
_{1}, Y_{2}, …. , Y_{n}is a production rule. If a terminal a in FIRST (Y_{j}) and ε is in all FIRST (Y_{j}) for j = 1, … , i -1, then a is in FIRST (X). If ε is in all FIRST (Y_{j}) for j = 1,… n, then ε is in FIRST (X). - If X is ε, then FIRST (X)= {∈ }
- If X is Y
_{1}, Y_{2}, … Y_{n }If a terminal a in FIRST (Y_{i}) and ∈ is in all FIRST (Y_{j}) for j = 1,… i -1, then a is in FIRST (X). If ∈ is in all FIRST (Y_{j}) for j =1,..n, then ∈ is in FIRST (X).

**Example:**

For the expression grammar

E → T E’

E’ →+T E’ | ε

T → F T’

T’ → * F T’ | ε

F → ( E ) | id

First(E) = First(T) = First(F) = { (, id }

First(E’) = {+, ε }

First(T’) = { *, ε }

**To compute FOLLOW (for Non-terminals):**

If S is the start symbol, $ is in FOLLOW (S).

- If A → αBβ is a production rule, then everything in FIRST (β) is FOLLOW (B) except ∈.
- If (A → αB is a production rule) or (A → αBβ is a production rule and ∈ is in FIRST (β) then everything in FOLLOW (A) is in FOLLOW (B).
- Apply these rules until nothing more can be added to any FOLLOW set.

**LL(1) Parsing algorithm**

- The parsing table is a two-dimensional array M [X,a] , where X is a non-terminal, and a is a terminal or the symbol $.
- The parser considers ‘X’ the symbol on top of stack, and ‘a’ the current input symbol
- These two symbols determine the action to be taken by the parser
- Assume that ‘$’ is a special token that is at the bottom of the stack and terminates the input string.

- If X = a = $, the parser halts and announces successful completion of parsing.
- If X = a ≠ $, the parser pops X off the stack and advances the input pointer to the next input symbol.
- If X is a nonterminal, the program consults entry M[X,a] of the parsing table M. This entry will be either an X-production of the grammar or an error entry. If, for example, M[X,a] = {X → UVW}, the parser replaces X on top of the stack by UVW (with U on the top). If M[X,a] = error, the parser calls an error recovery routine.

**Example: **Consider the grammar

E → T E’

E’ → +T E’ | ε

T → F T’

T’ → * F T’ | ε

F → ( E ) | id

Parse table for the grammar is given below:

For the above grammar and parsing table, we verify the string “id + id * id” in the following way with the help of parsing algorithm.

**Bottom-up Parsing Techniques**

A bottom-up parser creates the parse tree of the given input string from leaves towards the root. A bottom-up parser tries to find the right most derivation of the given input in the reverse order.

Bottom-up parsing is also known as shift reduce parsing.

- A more powerful parsing technique
- LR grammars – more expensive than LL
- Can handle left recursive grammars
- Can handle virtually all the programming languages
- Natural expression of programming language syntax
- Automatic generation of parsers (Yacc, Bison etc.)
- Detects errors as soon as possible
- Allows better error recovery

**Shift Reduce Parsing: **A shift reduce parser tries to reduce the given input string into the starting symbol. At each reduction step, a substring of the input matching to the right side of a production rule is replaced by the non-terminal at the left side of that production rule.

**Handle: **A handle of a string is a substring that matches the right side of a production rule.

- Handles always appear at the top of the stack and never inside it.
- This makes stack a suitable data structure.

**Actions: **There are four possible actions of a shift reduce parser

**Shift:**The next input symbol is shifted onto the top of the stack.**Reduce:**Replace the handle on the top of the stack by the non-terminal.**Accept:**Successful completion of parsing.**Error:**Parser discovers a syntax error and calls an error recovery routine.

**Conflicts During Shift Reduce Parsing**

There are CFGs for which shift reduce parser can’t be used. Stack contents and the next input symbol may not decide action.

The general shift-reduce technique is:

- if there is no handle on the stack then shift
- If there is a handle then reduce However, what happens when there is a choice?
- What action to take in case both shift and reduce are valid?

**Shift/Reduce Conflict: **Whether make a shift operation or a reduction.

**Reduce/Reduce Conflict: **The parser can’t decide which of several reductions to make.

**Types of Shift Reduce Parsing: **There are two main categories of shift reduce parsers

**Operator Precedence Parser: **Simple, but supports only a small class of grammars.

**LR Parsers:**

- LR parsers accept LR(k) languages L stands for left to right scan of input R stands for rightmost derivation k stands for number of lookahead tokens

**Types of LR Parsers:**

- SLR (Simple) LR parser
- CLR (general) LR parser (canonical LR)
- LALR (Intermediate) LR parser (look-ahead LR)

SLR, CLR and LALR work in same way, but their parsing tables may different.

Relative power of various classes :

SLR(1) ≤ LALR(1) ≤ LR(1)

SLR(k) ≤ LALR(k) ≤ LR(k)

LL(k)≤ LR(k)

SLR (1) < LALR (1) < LR (1)

SLR (k) < LALR (1) < LR (k)

LL (k) < LR (k)

**LR parsing: **LR parsing is most general non-back tracking shift reduce parsing. The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive parsers.

LL (1) grammars ⊆ LR (1)) grammars

An LR parser can detect a syntactic error as soon as it is possible.

A configuration of a LR parsing is

(S_{0 }X_{1} S_{1} … X_{m} S_{m}, a_{i} a_{i-1} … a_{n} $)

Stack Rest of input

- S
_{m}and a_{i}decides the parser action by consulting the parsing action table (initial stack contains just S_{0}). - A configuration of a LR parsing represents the right sentential form

X_{1} …. X_{m} a_{i} a_{i-1} … a_{n} $

**LR Parser Actions**

**Shift** S: Shift the next input symbol and the state S onto the stack

(S_{0 }X_{1} S_{1} … X_{m} S_{m}, a_{i} a_{i-1} … a_{n} $) → (S_{0 }X_{1} S_{1} … X_{m} S_{m}, a_{i }S, a_{i-1} … a_{n} $)

**Reduce A →** **β****: **Pop 2|β| (= r) items from the stack; let us assume that β = Y_{l}, Y_{2} … , Y_{r}

Then, push A and S, where S = goto [S_{m – r} , A]

(S_{0 }X_{1} S_{1} … X_{m} S_{m}, a_{i} a_{i+1} … a_{n} $) → (S_{0 }X_{1} S_{1} … X_{m-r} S_{m-r}, AS, a_{i }, a_{i-1} … a_{n} $)

**Accept:** Parsing successfully completed.

**Error:** Parser detected an error (an empty entry in the action table).

**Example:**

Consider the grammar And its parse table E E + T | T

T → T * F | F F → ( E ) | id

Parse id + id * id using the given grammar and bottom up parsing table.

**Answer:**

**Operator Precedence Parsing**

In an operator grammar, no production rule can have E at the right side and two adjacent non-terminals at the right side.

**Precedence Relations: **In operator precedence parsing, we define three disjoint precedence relations between certain pair of terminals.

a < b, b has higher precedence than a.

a = b, b has same precedence as a.

a > b, b has lower precedence than a.

The determination of correct precedence relation between terminals are based on the traditional notions of associativity and precedence of operator

You can follow the detailed champion study plan for GATE CS 2022 from the following link:

**Detailed GATE CSE 2022 Champion Study Plan**

Candidates can also practice 110+ Mock tests for exams like GATE, NIELIT with BYJU’S Exam Prep Test Series check the following link:

### Click Here to Avail GATE CSE Test Series!(100+ Mock Tests)

Get unlimited access to 21+ structured Live Courses all 112+ mock tests with Online Classroom Program for GATE CS & PSU Exams:

### Click here to avail Online Classroom Program for Computer Science Engineering

**Thanks**

**Download BYJU’S Exam Prep, Best gate exam app for Preparation**