Study Notes on Lexical Analysis
By BYJU'S Exam Prep
Updated on: September 25th, 2023
Lexical analyzer reads the source program character by character and returns the tokens of the source program. It puts information about identifiers into the symbol table.
The Role of Lexical Analyzer:
- It is the first phase of a compiler
- It reads the input character and produces an output sequence of tokens that the Parser uses for syntax analysis.
- It can either work as a separate module or as a submodule.
- Lexical Analyzer is also responsible for eliminating comments and white spaces from the source program.
- It also generates lexical errors.
- Lexical Analyzer is also responsible for eliminating comments and white spaces from the source program.
- It also generates lexical errors.
Tokens, Lexemes and Patterns
- A token describes a pattern of characters having same meaning in the source program such as identifiers, operators, keywords, numbers, delimiters and so on. A token may have a single attribute which holds the required information for that token. For identifiers, this attribute is a pointer to the symbol table and the symbol table holds the actual attributes for that token.
- Token type and its attribute uniquely identifies a lexeme.
- Regular expressions are widely used to specify pattern.
Tokens, Patterns and Lexemes
- Pattern: Starting with a letter and followed by letter or digit but not a keyword.
- Token: ID
Lexeme: If | Then | Else
- Pattern: If | Then | Else
- Token: IF | THEN | ELSE
Lexeme: 123.45
- Pattern: Starting with digit followed by a digit or optional fraction and or optional exponent
- Token: NUM
Counting Number of tokens :
A token is usually described by an integer representing the kind of token, possibly together with an attribute, representing the value of the token. For example, in most programming languages we have the following kinds of tokens.
- Identifiers (x, y, average, etc.)
- Reserved or keywords (if, else, while, etc.)
- Integer constants (42, 0xFF, 0177 etc.)
- Floating point constants (5.6, 3.6e8, etc.)
- String constants (hello there\n, etc.)
- Character constants (‘a’, ‘b’, etc.)
- Special symbols (( ) : := + – etc.)
- Comments (To be ignored.)
- Compiler directives (Directives to include files, define macros, etc.)
- Line information (We might need to detect newline characters as tokens, if they are syntactically important. We must also increment the line count, so that we can indicate the line number for error messages.)
- White space (Blanks and tabs that are used to separate tokens, but otherwise are not important).
- End of file
Each reserved word or special symbol is considered to be a different kind of token, as far as the parser is concerned. They are distinguished by a different integer to represent their kind.
Example :
So that was all about lexical analysis. Now practice the questions from the app.
You can follow the detailed champion study plan for GATE CS 2022 from the following link:
Detailed GATE CSE 2022 Champion Study Plan
Candidates can also practice 110+ Mock tests for exams like GATE, NIELIT with BYJU’S Exam Prep Test Series check the following link:
Click Here to Avail GATE CSE Test Series!(100+ Mock Tests)
Get unlimited access to 21+ structured Live Courses all 112+ mock tests with Online Classroom Program for GATE CS & PSU Exams:
Click here to avail Online Classroom Program for Computer Science Engineering
Thanks