Tuesday, October 20, 2009

A simple lexical analyzer demo in Haskell

Do get some better Haskell practice, I developed a small lexical analyzer in Haskell. This uses basic Haskell functionality and doesn't use sophisticated libraries like Parsec.

Here is the output of this effort:

This script implements a simple lexical analysis program
based on the example in Modern Compiler Design Book Section 2.1


The lexical analyzer supports following tokens.

Operators: +, -, * and /
Separators: ; , {} ( and )
Integers (a string of digits)
Identifiers: first character must be a letter
other characters can be letter or digit or underscore
two underscores consecutively are not allowed
last character cannot be an underscore
EOF token : represents end of file

Also supported are:

Comments: They start with # and end with #
Layout characters: all characters less than equal to ' ' are
considered as layout characters


Some specific limitations of this implemenation are:
- Its a limited form of recursive descent parser. Based on
the first character in the token, we assume the token type
and continue to read the rest of token.
- It doesn't track the source code position. This would actually
require some form of monad since the source code position tracking
is a sort of side effect to the lexical analysis.
- The lexical analyzer doesn't support too many debugging facilities.
- Lack of unit tests at this point of time.


{- following data type defines the tokens supported by this lexer
the C version in the MCD book uses #defines for different token
classes. Haskell version is much more cleaner with proper data
constructors attached to each token class and automatically
generated show and eq functions.
import System.IO

-- a token can be
data Token = EOF | -- end of file
Identifier String | -- an identifier string
IntegerToken Integer | -- an integer
Operator Char | -- an operator
Separator Char | -- a separator
Error String -- or some erroneous token
deriving (Show, Eq)

-- lexer reads a string and returns a list of tokens
lexer :: String->[Token]
{- we need to remove comments and other layout characters
before reading a token. This is an example of conducting
some preprocessing before doing an actual operation.
Such a situation should be implemented using two sister
functions (here lexer and lexer').
lexer does the job of removing layout and comments
lexer' does the job of reading a token
lexer calls lexer' (for remaining data)
lexer' calls back lexer (for remaining data)

This kind of strucutre eliminates complicated if-then-else logic.
lexer xs = let s = skipLayoutAndComment xs
in lexer' s

-- for an empty string, lex returns an empty list
lexer' [] = [EOF]
-- for a non-empty string we read the first token
-- then we let lexer tokenize the rest of string
lexer' xs = let (t, rest)= nextToken xs
in (t:lexer rest)

-- next token reads the input string and returns the next token and
-- rest of string
nextToken:: String->(Token, String)

nextToken [] = (EOF, [])
{- The implementation is a five way branch based on the first character.
If the first character is a letter, we read an identifier.
If the first character is a digit, we read a number.
If the first character is an operator, we simply read an operator token.
If the first character is a separator, we simply read a separator token.
Otherwise, we consider this as an erroneous character and generate
an error token for this character.

The lexical analyzer shouldn't stop analyzing the file at any point
of time. If it doesn't recognize any character or a string of characters
in the file, then it should simply generate an Error token at that point
of time and proceed further with reading next tokens.

It should be the job of parser to identify real syntax errors in the
code and report the user accordingly.
nextToken l@(x:xs) | isLetter x = let (first, rest) = readIdentfier l
in (Identifier first, rest)
| isNumDigit x = let (first, rest) = readInteger l
in (IntegerToken first, rest)
| isOperator x = (Operator x, xs)
| isSeparator x = (Separator x, xs)
| otherwise = (Error [x], xs)

-- reads an integer and returns the token plus rest of string
readInteger xs = let (first, rest) = span isNumDigit xs in
(read first :: Integer, rest)

An identifier consists of firstCharacter and restCharacter*
firstCharacter can be a letter
restCharacter can be letter or digit or underscore
more than one underscores consecutively are not allowed
this should be called only if first character of the input
list is a letter.

Here we assume that the first character is a letter. This
is actually ensured by the nextToken function defined above.
We thus focus on reading rest of identfier.
readIdentfier [] = ([], [])
readIdentfier l@(x:xs) = let (first, rest) = readLettersOrDigits xs
(first', rest') = underscoreTails rest
identifier = (x:first) ++ first'
in (identifier, rest')

readLettersOrDigits :: String->(String, String)
Reads an input string and splits it in two parts. The
first part (possibly empty) consists of a sequence of letter
or digit characters.

The 2nd part contains the rest of input string.
readLettersOrDigits xs = span isLetterOrDigit xs

underscoreTail :: String->(String, String)
Reads an undscore tail part of an identifier
Following rules must be observed:
- the tail starts with an underscore
- the underscore must be followed by a letter or a digit
- following this, there may be 0 or more letters or digits
-- empty list would result in empty underscore tail
underscoreTail [] = ([], [])
-- at least two characters are required in undescore tail
underscoreTail [x] = ([], [x])
-- the first character must be an underscore
-- 2nd character must be a letter or digit
-- following characters (if any) must be letter or digit
underscoreTail l@(x:y:xs) = if isUnderScore x && isLetterOrDigit y
-- lets read the rest of letters/digits of this
-- tail
then let (first,rest) = readLettersOrDigits xs
in (x:y:first, rest)
-- there is no underscoreTail here
else ("", l)

underscoreTails :: String->(String, String)
-- reads 0 or more occurances of underscoreTail
underscoreTails xs = let -- lets try to read the first underscoreTail
(first, rest) = underscoreTail xs in
if null first
-- if first tail is absent, then we just return
then (first, rest)
-- otherwise we read rest of the tail recursively
else let (first', rest') = underscoreTails rest
-- we join the first tail and rest of the tail
-- we return this along with unread string
in (first ++ first' , rest')

-- skip layout and comment
skipLayoutAndComment = skipLayout . skipComment

-- skips layout
skipLayout [] = []
skipLayout l@(x:xs) = case isLayout x of
True -> skipLayout xs
False -> l
skipComment :: String->String
{- skips comment
The first character must be #.
Following this, we need to keep skipping till we find
another #.
skipComment [] = []
skipComment l@(x:xs) = case isCommentStart x of
True -> -- we need to find the end of comment
skipTillCommentEnd xs
False -> l

-- skips till the end of comment character is found
-- this should be called from skip comment function
-- this routine is perfect for recursive tail optimization
skipTillCommentEnd [] = []
skipTillCommentEnd l@(x:xs) = case isCommentEnd x of
True-> xs
False-> skipTillCommentEnd xs

-- character level tests
-- a layout character
isLayout ch = ch <= ' '
-- comment start character
isCommentStart ch = ch == '#'
-- comment end character
isCommentEnd ch = ch == '#'
-- upper case letter
isUpperCase ch = 'A' <= ch && ch <= 'Z'
-- lower case letter
isLowerCase ch = 'a' <= ch && ch <= 'z'
-- a letter
isLetter ch = isLowerCase ch || isUpperCase ch
-- a digit
isNumDigit ch = '0' <= ch && ch <= '9'
-- a letter or a digit
isLetterOrDigit ch = isLetter ch || isNumDigit ch
-- an underscore
isUnderScore ch = ch == '_'
-- an operator
isOperator ch = ch `elem` "+-*/"
-- a separator
isSeparator ch = ch `elem` ";,(){}"

main :: IO ()
main = do
putStrLn "Simple Lexical Analyzer v0.1"

mainloop :: IO ()
mainloop = do
putStr ">>> "
hFlush stdout
inputStr <- getLine
case inputStr of
"q" -> return ()
otherwise ->
let tokens = lexer inputStr
putStrLn $ show tokens

A typical session for this program looks like follows:

runghc MCDLex.hs
Simple Lexical Analyzer v0.1
>>> abc
[Identifier "abc",EOF]
>>> 2344
[IntegerToken 2344,EOF]
>>> +
[Operator '+',EOF]
>>> +-/*
[Operator '+',Operator '-',Operator '/',Operator '*',EOF]
>>> (){},;
[Separator '(',Separator ')',Separator '{',Separator '}',Separator ',',Separator ';',EOF]
>>> #thiscommentwillbeignored" todo_list
>>> #thiscommentwillbeignored# todo_list
[Identifier "todo_list",EOF]
>>> (3+4)
[Separator '(',IntegerToken 3,Operator '+',IntegerToken 4,Separator ')',EOF]
>>> errors__
[Identifier "errors",Error "_",Error "_",EOF]
>>> 23___45
[IntegerToken 23,Error "_",Error "_",Error "_",IntegerToken 45,EOF]

Note: I tried to use the "prettyprint" javascript support to beutify this code. But it seems like the pretty printer is not able to parse haskell properly. Hence I am removing it and showing the code in plain pre tag.

1 comment:

Perifort said...

so good an example!thx~