## Tuesday, October 20, 2009

### A simple lexical analyzer demo in Haskell

Do get some better Haskell practice, I developed a small lexical analyzer in Haskell. This uses basic Haskell functionality and doesn't use sophisticated libraries like Parsec.

Here is the output of this effort:

{-This script implements a simple lexical analysis programbased on the example in Modern Compiler Design Book Section 2.1Tokens--------The lexical analyzer supports following tokens.Operators: +, -, * and /Separators: ; , {} ( and )Integers (a string of digits)Identifiers: first character must be a letter             other characters can be letter or digit or underscore             two underscores consecutively are not allowed             last character cannot be an underscoreEOF token : represents end of fileAlso supported are:Comments: They start with # and end with #Layout characters: all characters less than equal to ' ' are    considered as layout charactersLimitations----------------Some specific limitations of this implemenation are:- Its a limited form of recursive descent parser. Based on  the first character in the token, we assume the token type  and continue to read the rest of token.- It doesn't track the source code position. This would actually  require some form of monad since the source code position tracking  is a sort of side effect to the lexical analysis.- The lexical analyzer doesn't support too many debugging facilities.- Lack of unit tests at this point of time.-}{- following data type defines the tokens supported by this lexer   the C version in the MCD book uses #defines for different token   classes. Haskell version is much more cleaner with proper data   constructors attached to each token class and automatically   generated show and eq functions.-}import System.IO-- a token can bedata Token = EOF | -- end of file             Identifier String | -- an identifier string             IntegerToken Integer | -- an integer             Operator Char | -- an operator             Separator Char | -- a separator             Error String -- or some erroneous token             deriving (Show, Eq)-- lexer reads a string and returns a list of tokenslexer :: String->[Token]{- we need to remove comments and other layout characters   before reading a token. This is an example of conducting   some preprocessing before doing an actual operation.   Such a situation should be implemented using two sister   functions (here lexer and lexer').   lexer does the job of removing layout and comments   lexer' does the job of reading a token   lexer calls lexer' (for remaining data)   lexer' calls back lexer (for remaining data)   This kind of strucutre eliminates complicated if-then-else logic.-}lexer xs =  let s =  skipLayoutAndComment xs            in lexer' s-- for an empty string, lex returns an empty listlexer' [] = [EOF]-- for a non-empty string we read the first token-- then we let lexer tokenize the rest of stringlexer' xs = let (t, rest)= nextToken xs            in (t:lexer rest)-- next token reads the input string and returns the next token and-- rest of stringnextToken:: String->(Token, String)nextToken [] = (EOF, []){- The implementation is a five way branch based on the first character.   If the first character is a letter, we read an identifier.   If the first character is a digit, we read a number.   If the first character is an operator, we simply read an operator token.   If the first character is a separator, we simply read a separator token.   Otherwise, we consider this as an erroneous character and generate   an error token for this character.   The lexical analyzer shouldn't stop analyzing the file at any point   of time. If it doesn't recognize any character or a string of characters   in the file, then it should simply generate an Error token at that point   of time and proceed further with reading next tokens.   It should be the job of parser to identify real syntax errors in the   code and report the user accordingly.-}nextToken l@(x:xs) | isLetter x =  let (first, rest) = readIdentfier l                                      in (Identifier first, rest)                  | isNumDigit x = let (first, rest) = readInteger l                                       in (IntegerToken first, rest)                  | isOperator x = (Operator x, xs)                  | isSeparator x = (Separator x, xs)                  | otherwise = (Error [x], xs)-- reads an integer and returns the token plus rest of stringreadInteger xs = let (first, rest)  = span isNumDigit xs in                     (read first :: Integer, rest){-An identifier consists of firstCharacter and restCharacter*firstCharacter can be a letterrestCharacter can be letter or digit or underscoremore than one underscores consecutively are not allowedthis should be called only if first character of the inputlist is a letter.Here we assume that the first character is a letter. Thisis actually ensured by the nextToken function defined above.We thus focus on reading rest of identfier.-}readIdentfier [] = ([], [])readIdentfier l@(x:xs) = let (first, rest) = readLettersOrDigits xs                             (first', rest') = underscoreTails rest                             identifier = (x:first) ++ first'                             in (identifier, rest')readLettersOrDigits :: String->(String, String){-Reads an input string and splits it in two parts. Thefirst part (possibly empty) consists of a sequence of letteror digit characters.The 2nd part contains the rest of input string.-}readLettersOrDigits xs = span isLetterOrDigit xsunderscoreTail :: String->(String, String){-Reads an undscore tail part of an identifierFollowing rules must be observed:- the tail starts with an underscore- the underscore must be followed by a letter or a digit- following this, there may be 0 or more letters or digits-}-- empty list would result in empty underscore tailunderscoreTail [] = ([], [])-- at least two characters are required in undescore tailunderscoreTail [x] = ([], [x])-- the first character must be an underscore-- 2nd character must be a letter or digit-- following characters (if any) must be letter or digitunderscoreTail l@(x:y:xs) = if isUnderScore x && isLetterOrDigit y                               -- lets read the rest of letters/digits of this                               -- tail                               then let (first,rest) = readLettersOrDigits xs                                    in (x:y:first, rest)                               -- there is no underscoreTail here                               else ("", l)underscoreTails :: String->(String, String)-- reads 0 or more occurances of underscoreTailunderscoreTails xs = let -- lets try to read the first underscoreTail                         (first, rest) = underscoreTail xs in                         if null first                            -- if first tail is absent, then we just return                            then (first, rest)                            -- otherwise we read rest of the tail recursively                            else let (first', rest') = underscoreTails rest                                 -- we join the first tail and rest of the tail                                 -- we return this along with unread string                                 in (first ++ first' , rest')-- skip layout and commentskipLayoutAndComment = skipLayout . skipComment-- skips layoutskipLayout [] = []skipLayout l@(x:xs) = case isLayout x of                                True -> skipLayout xs                                False -> lskipComment :: String->String{- skips commentThe first character must be #.Following this, we need to keep skipping till we findanother #.-}skipComment [] = []skipComment l@(x:xs) = case isCommentStart x of                                True -> -- we need to find the end of comment                                    skipTillCommentEnd xs                                False -> l-- skips till the end of comment character is found-- this should be called from skip comment function-- this routine is perfect for recursive tail optimizationskipTillCommentEnd [] = []skipTillCommentEnd l@(x:xs) = case isCommentEnd x of                                True-> xs                                False-> skipTillCommentEnd xs-- character level tests-- a layout characterisLayout ch = ch <= ' '-- comment start characterisCommentStart ch = ch == '#'-- comment end characterisCommentEnd ch = ch == '#'-- upper case letterisUpperCase ch = 'A' <= ch && ch <= 'Z'-- lower case letterisLowerCase ch = 'a' <= ch && ch <= 'z'-- a letterisLetter ch = isLowerCase ch || isUpperCase ch-- a digitisNumDigit ch = '0' <= ch && ch <= '9'-- a letter or a digitisLetterOrDigit ch = isLetter ch || isNumDigit ch-- an underscoreisUnderScore ch = ch == '_'-- an operatorisOperator ch = ch elem "+-*/"-- a separatorisSeparator ch = ch elem ";,(){}"main :: IO ()main = do    putStrLn "Simple Lexical Analyzer v0.1"    mainloopmainloop :: IO ()mainloop = do    putStr ">>> "    hFlush stdout    inputStr <- getLine    case inputStr of      "q" -> return ()      otherwise ->        do          let tokens = lexer inputStr          putStrLn \$ show tokens          mainloop

A typical session for this program looks like follows:

runghc MCDLex.hsSimple Lexical Analyzer v0.1>>> abc[Identifier "abc",EOF]>>> 2344[IntegerToken 2344,EOF]>>> +[Operator '+',EOF]>>> +-/*[Operator '+',Operator '-',Operator '/',Operator '*',EOF]>>> (){},;[Separator '(',Separator ')',Separator '{',Separator '}',Separator ',',Separator ';',EOF]>>> #thiscommentwillbeignored"  todo_list[EOF]>>> #thiscommentwillbeignored#  todo_list[Identifier "todo_list",EOF]>>> (3+4)[Separator '(',IntegerToken 3,Operator '+',IntegerToken 4,Separator ')',EOF]>>> errors__[Identifier "errors",Error "_",Error "_",EOF]>>> 23___45[IntegerToken 23,Error "_",Error "_",Error "_",IntegerToken 45,EOF]>>>

Note: I tried to use the "prettyprint" javascript support to beutify this code. But it seems like the pretty printer is not able to parse haskell properly. Hence I am removing it and showing the code in plain pre tag.

#### 1 comment:

Perifort said...

so good an example!thx~