All Articles

Lexer

A basic compiler consists of four major parts, Lexical analyzer, Syntax analyzer, Semantic analyzer and Code generation.

On winter of 2020, I made a compiler as a school project and I would like to share briefly how I wrote the first part which is lexical analyzer.

The job of lexical analyzer is extract all the valid tokens in a sequence of characters and pass them to the parser to make the abstract syntax tree. Before we dive into the more interesting things, there are some terms I use frequently throughout this article which I’d like to define.

Definitions

Token

A token is an object that holds name, type and location of string the was matched by Tokenizer

Tokenizer

A tokenizer finds a right match for a sequence of characters using regular expressions.

Token Type

A token type is an enum which indicates whether the matched string is an identifier or keyword.

Keyword

Keywords vary according to the programming language, but in order to keep things short and simple let’s assume this set of small keywords for our language:

main,
integer,
local,
do,
end

Before we talk about each step, consider the following text as our valid example written in our language for this article:

main
  local
    integer price;
  do
    price = 12;
  end

Step 1 (Read the content of the file)

First, we read the entire file and store it as a string, in C++ we would have something like this:

void Lexer::read(std::string &fileName) {
  std::string fileContent;
  std::string line;
  std::ifstream stream;
  stream.open(fileName, std::ios::in);

  //read until the end of the file
  //and append each line to the fileContent
  while(!stream.eof()) {
    std::getline(stream, line);
    fileContent += line + "\n";
  }
   ....  
}

we would have the following text as the value of fileContent:

main\n\tlocal\n\t\tinteger price;\n\tdo\n\t\tprice = 12;\n\tend

Step 2 (Write regex)

Now we need to find the right tokens in the fileContent, but how? You could build a DFA to detect a right token, but the easier way would be regex matching. Regular expressions could vary according to the lexer, but for sake of simplicity consider the following lexical elements for the language:

  id = letter alphanum*
  alphanum = letter | digit | _
  integer = nonzero digit* | 0
  letter = a..z | A..Z
  digit = 0..9
  nonzero = 1..9
  
  //punctuation and reserved words:
  main, do, end, integer, local
  ;

so in order to detect id as a token we need to match it againts:

^(([a-z]|[A-Z])([a-z]|[A-Z]|[1-9][0-9]*|_)*)  

Explanation:
[a-z]|[A-Z] matches with letter
([a-z]|[A-Z]|[1-9][0-9]*|_)*) matches with alphanum*
^ matches if only found in start of string

or to detect integer as a token we should match it againts:

^([1-9][0-9]*|0)

Explanation:
[1-9] matches with nonzero
[0-9]* matches  with digit*
^ matches if only found in start of string

and to detect punctuation as a toekn we should match it againts

^(=|;)

Step 3 (Pass fileContent to tokenizer to find the first matched token)

As the last step, fileContent needs to be passed to the tokenizer in order to be matched against the right regular expressions written in step 2. Every time a token is returned by the tokenizer the fileContent is trimmed by the length of the token. Let’s try it on our example to understand better. From step 1 we see that the fileContent looks like this:

main\n\tlocal\n\t\tinteger price;\n\tdo\n\t\tprice = 12;\n\tend

after passing it to the tokenizer, tokenizer return main as an id token, then the lexer trims fileContent by 4 (length of main) now fileContent looks like this:

//the lexer ignores `\n`, `\t` and ' ' (space)

//second time
local\n\t\tinteger price;\n\tdo\n\t\tprice = 12;\n\tend

//third time
integer price;\n\tdo\n\t\tprice = 12;\n\tend

//fourth time
price;\n\tdo\n\t\tprice = 12;\n\tend

//fifth time
;\n\tdo\n\t\tprice = 12;\n\tend

this process continues until fileContent is empty. At the end, our list of tokens looks like this:

//[token name, type, line number]
[main, keyword(id) ,1]
[local, keyword(id), 2]
[integer, keyword(id), 3], [price, 3], [;, 3]
[do, keyword(id), 4]
[price, id, 5], [=, punctuation, 5], [12, _Integer, 5]
[end, keyword(id), 6]

This was a brief explanation of a way to write a lexer, to see the full source code please check here

Please reach out to me on social media if you have any questions or suggestions.

Thanks for reading!