Programming tidbits: A simple lexical analyzer demo in Haskell

Do get some better Haskell practice, I developed a small lexical analyzer in Haskell. This uses basic Haskell functionality and doesn't use sophisticated libraries like Parsec.

Here is the output of this effort:


{-
This script implements a simple lexical analysis program
based on the example in Modern Compiler Design Book Section 2.1


Tokens
--------

The lexical analyzer supports following tokens.

Operators: +, -, * and /
Separators: ; , {} ( and )
Integers (a string of digits)
Identifiers: first character must be a letter
             other characters can be letter or digit or underscore
             two underscores consecutively are not allowed
             last character cannot be an underscore
EOF token : represents end of file

Also supported are:

Comments: They start with # and end with #
Layout characters: all characters less than equal to ' ' are
    considered as layout characters

Limitations
----------------

Some specific limitations of this implemenation are:
- Its a limited form of recursive descent parser. Based on
  the first character in the token, we assume the token type
  and continue to read the rest of token.
- It doesn't track the source code position. This would actually
  require some form of monad since the source code position tracking
  is a sort of side effect to the lexical analysis.
- The lexical analyzer doesn't support too many debugging facilities.
- Lack of unit tests at this point of time.

-}



{- following data type defines the tokens supported by this lexer
   the C version in the MCD book uses #defines for different token
   classes. Haskell version is much more cleaner with proper data
   constructors attached to each token class and automatically
   generated show and eq functions.
-}
import System.IO

-- a token can be
data Token = EOF | -- end of file
             Identifier String | -- an identifier string
             IntegerToken Integer | -- an integer
             Operator Char | -- an operator
             Separator Char | -- a separator
             Error String -- or some erroneous token
             deriving (Show, Eq)



-- lexer reads a string and returns a list of tokens
lexer :: String->[Token]
{- we need to remove comments and other layout characters
   before reading a token. This is an example of conducting
   some preprocessing before doing an actual operation.
   Such a situation should be implemented using two sister
   functions (here lexer and lexer').
   lexer does the job of removing layout and comments
   lexer' does the job of reading a token
   lexer calls lexer' (for remaining data)
   lexer' calls back lexer (for remaining data)

   This kind of strucutre eliminates complicated if-then-else logic.
-}
lexer xs =  let s =  skipLayoutAndComment xs
            in lexer' s

-- for an empty string, lex returns an empty list
lexer' [] = [EOF]
-- for a non-empty string we read the first token
-- then we let lexer tokenize the rest of string
lexer' xs = let (t, rest)= nextToken xs
            in (t:lexer rest)

-- next token reads the input string and returns the next token and
-- rest of string
nextToken:: String->(Token, String)

nextToken [] = (EOF, [])
{- The implementation is a five way branch based on the first character.
   If the first character is a letter, we read an identifier.
   If the first character is a digit, we read a number.
   If the first character is an operator, we simply read an operator token.
   If the first character is a separator, we simply read a separator token.
   Otherwise, we consider this as an erroneous character and generate
   an error token for this character.

   The lexical analyzer shouldn't stop analyzing the file at any point
   of time. If it doesn't recognize any character or a string of characters
   in the file, then it should simply generate an Error token at that point
   of time and proceed further with reading next tokens.

   It should be the job of parser to identify real syntax errors in the
   code and report the user accordingly.
-}
nextToken l@(x:xs) | isLetter x =  let (first, rest) = readIdentfier l
                                      in (Identifier first, rest)
                  | isNumDigit x = let (first, rest) = readInteger l
                                       in (IntegerToken first, rest)
                  | isOperator x = (Operator x, xs)
                  | isSeparator x = (Separator x, xs)
                  | otherwise = (Error [x], xs)

-- reads an integer and returns the token plus rest of string
readInteger xs = let (first, rest)  = span isNumDigit xs in
                     (read first :: Integer, rest)


{-
An identifier consists of firstCharacter and restCharacter*
firstCharacter can be a letter
restCharacter can be letter or digit or underscore
more than one underscores consecutively are not allowed
this should be called only if first character of the input
list is a letter.

Here we assume that the first character is a letter. This
is actually ensured by the nextToken function defined above.
We thus focus on reading rest of identfier.
-}
readIdentfier [] = ([], [])
readIdentfier l@(x:xs) = let (first, rest) = readLettersOrDigits xs
                             (first', rest') = underscoreTails rest
                             identifier = (x:first) ++ first'
                             in (identifier, rest')


readLettersOrDigits :: String->(String, String)
{-
Reads an input string and splits it in two parts. The
first part (possibly empty) consists of a sequence of letter
or digit characters.

The 2nd part contains the rest of input string.
-}
readLettersOrDigits xs = span isLetterOrDigit xs

underscoreTail :: String->(String, String)
{-
Reads an undscore tail part of an identifier
Following rules must be observed:
- the tail starts with an underscore
- the underscore must be followed by a letter or a digit
- following this, there may be 0 or more letters or digits
-}
-- empty list would result in empty underscore tail
underscoreTail [] = ([], [])
-- at least two characters are required in undescore tail
underscoreTail [x] = ([], [x])
-- the first character must be an underscore
-- 2nd character must be a letter or digit
-- following characters (if any) must be letter or digit
underscoreTail l@(x:y:xs) = if isUnderScore x && isLetterOrDigit y
                               -- lets read the rest of letters/digits of this
                               -- tail
                               then let (first,rest) = readLettersOrDigits xs
                                    in (x:y:first, rest)
                               -- there is no underscoreTail here
                               else ("", l)

underscoreTails :: String->(String, String)
-- reads 0 or more occurances of underscoreTail
underscoreTails xs = let -- lets try to read the first underscoreTail
                         (first, rest) = underscoreTail xs in
                         if null first
                            -- if first tail is absent, then we just return
                            then (first, rest)
                            -- otherwise we read rest of the tail recursively
                            else let (first', rest') = underscoreTails rest
                                 -- we join the first tail and rest of the tail
                                 -- we return this along with unread string
                                 in (first ++ first' , rest')

-- skip layout and comment
skipLayoutAndComment = skipLayout . skipComment

-- skips layout
skipLayout [] = []
skipLayout l@(x:xs) = case isLayout x of
                                True -> skipLayout xs
                                False -> l
skipComment :: String->String
{- skips comment
The first character must be #.
Following this, we need to keep skipping till we find
another #.
-}
skipComment [] = []
skipComment l@(x:xs) = case isCommentStart x of
                                True -> -- we need to find the end of comment
                                    skipTillCommentEnd xs
                                False -> l

-- skips till the end of comment character is found
-- this should be called from skip comment function
-- this routine is perfect for recursive tail optimization
skipTillCommentEnd [] = []
skipTillCommentEnd l@(x:xs) = case isCommentEnd x of
                                True-> xs
                                False-> skipTillCommentEnd xs

-- character level tests
-- a layout character
isLayout ch = ch <= ' '
-- comment start character
isCommentStart ch = ch == '#'
-- comment end character
isCommentEnd ch = ch == '#'
-- upper case letter
isUpperCase ch = 'A' <= ch && ch <= 'Z'
-- lower case letter
isLowerCase ch = 'a' <= ch && ch <= 'z'
-- a letter
isLetter ch = isLowerCase ch || isUpperCase ch
-- a digit
isNumDigit ch = '0' <= ch && ch <= '9'
-- a letter or a digit
isLetterOrDigit ch = isLetter ch || isNumDigit ch
-- an underscore
isUnderScore ch = ch == '_'
-- an operator
isOperator ch = ch `elem` "+-*/"
-- a separator
isSeparator ch = ch `elem` ";,(){}"

main :: IO ()
main = do
    putStrLn "Simple Lexical Analyzer v0.1"
    mainloop

mainloop :: IO ()
mainloop = do
    putStr ">>> "
    hFlush stdout
    inputStr <- getLine
    case inputStr of
      "q" -> return ()
      otherwise ->
        do
          let tokens = lexer inputStr
          putStrLn $ show tokens
          mainloop

A typical session for this program looks like follows:


runghc MCDLex.hs
Simple Lexical Analyzer v0.1
>>> abc
[Identifier "abc",EOF]
>>> 2344
[IntegerToken 2344,EOF]
>>> +
[Operator '+',EOF]
>>> +-/*
[Operator '+',Operator '-',Operator '/',Operator '*',EOF]
>>> (){},;
[Separator '(',Separator ')',Separator '{',Separator '}',Separator ',',Separator ';',EOF]
>>> #thiscommentwillbeignored"  todo_list
[EOF]
>>> #thiscommentwillbeignored#  todo_list
[Identifier "todo_list",EOF]
>>> (3+4)
[Separator '(',IntegerToken 3,Operator '+',IntegerToken 4,Separator ')',EOF]
>>> errors__
[Identifier "errors",Error "_",Error "_",EOF]
>>> 23___45
[IntegerToken 23,Error "_",Error "_",Error "_",IntegerToken 45,EOF]
>>>

Note: I tried to use the "prettyprint" javascript support to beutify this code. But it seems like the pretty printer is not able to parse haskell properly. Hence I am removing it and showing the code in plain pre tag.

Programming tidbits

Tuesday, October 20, 2009

A simple lexical analyzer demo in Haskell

1 comment:

Followers

Labels

Search This Blog

Blog Archive