## Tuesday, October 20, 2009

### A simple lexical analyzer demo in Haskell

Do get some better Haskell practice, I developed a small lexical analyzer in Haskell. This uses basic Haskell functionality and doesn't use sophisticated libraries like Parsec.

Here is the output of this effort:

{-This script implements a simple lexical analysis programbased on the example in Modern Compiler Design Book Section 2.1Tokens--------The lexical analyzer supports following tokens.Operators: +, -, * and /Separators: ; , {} ( and )Integers (a string of digits)Identifiers: first character must be a letter             other characters can be letter or digit or underscore             two underscores consecutively are not allowed             last character cannot be an underscoreEOF token : represents end of fileAlso supported are:Comments: They start with # and end with #Layout characters: all characters less than equal to ' ' are    considered as layout charactersLimitations----------------Some specific limitations of this implemenation are:- Its a limited form of recursive descent parser. Based on  the first character in the token, we assume the token type  and continue to read the rest of token.- It doesn't track the source code position. This would actually  require some form of monad since the source code position tracking  is a sort of side effect to the lexical analysis.- The lexical analyzer doesn't support too many debugging facilities.- Lack of unit tests at this point of time.-}{- following data type defines the tokens supported by this lexer   the C version in the MCD book uses #defines for different token   classes. Haskell version is much more cleaner with proper data   constructors attached to each token class and automatically   generated show and eq functions.-}import System.IO-- a token can bedata Token = EOF | -- end of file             Identifier String | -- an identifier string             IntegerToken Integer | -- an integer             Operator Char | -- an operator             Separator Char | -- a separator             Error String -- or some erroneous token             deriving (Show, Eq)-- lexer reads a string and returns a list of tokenslexer :: String->[Token]{- we need to remove comments and other layout characters   before reading a token. This is an example of conducting   some preprocessing before doing an actual operation.   Such a situation should be implemented using two sister   functions (here lexer and lexer').   lexer does the job of removing layout and comments   lexer' does the job of reading a token   lexer calls lexer' (for remaining data)   lexer' calls back lexer (for remaining data)   This kind of strucutre eliminates complicated if-then-else logic.-}lexer xs =  let s =  skipLayoutAndComment xs            in lexer' s-- for an empty string, lex returns an empty listlexer' [] = [EOF]-- for a non-empty string we read the first token-- then we let lexer tokenize the rest of stringlexer' xs = let (t, rest)= nextToken xs            in (t:lexer rest)-- next token reads the input string and returns the next token and-- rest of stringnextToken:: String->(Token, String)nextToken [] = (EOF, []){- The implementation is a five way branch based on the first character.   If the first character is a letter, we read an identifier.   If the first character is a digit, we read a number.   If the first character is an operator, we simply read an operator token.   If the first character is a separator, we simply read a separator token.   Otherwise, we consider this as an erroneous character and generate   an error token for this character.   The lexical analyzer shouldn't stop analyzing the file at any point   of time. If it doesn't recognize any character or a string of characters   in the file, then it should simply generate an Error token at that point   of time and proceed further with reading next tokens.   It should be the job of parser to identify real syntax errors in the   code and report the user accordingly.-}nextToken l@(x:xs) | isLetter x =  let (first, rest) = readIdentfier l                                      in (Identifier first, rest)                  | isNumDigit x = let (first, rest) = readInteger l                                       in (IntegerToken first, rest)                  | isOperator x = (Operator x, xs)                  | isSeparator x = (Separator x, xs)                  | otherwise = (Error [x], xs)-- reads an integer and returns the token plus rest of stringreadInteger xs = let (first, rest)  = span isNumDigit xs in                     (read first :: Integer, rest){-An identifier consists of firstCharacter and restCharacter*firstCharacter can be a letterrestCharacter can be letter or digit or underscoremore than one underscores consecutively are not allowedthis should be called only if first character of the inputlist is a letter.Here we assume that the first character is a letter. Thisis actually ensured by the nextToken function defined above.We thus focus on reading rest of identfier.-}readIdentfier [] = ([], [])readIdentfier l@(x:xs) = let (first, rest) = readLettersOrDigits xs                             (first', rest') = underscoreTails rest                             identifier = (x:first) ++ first'                             in (identifier, rest')readLettersOrDigits :: String->(String, String){-Reads an input string and splits it in two parts. Thefirst part (possibly empty) consists of a sequence of letteror digit characters.The 2nd part contains the rest of input string.-}readLettersOrDigits xs = span isLetterOrDigit xsunderscoreTail :: String->(String, String){-Reads an undscore tail part of an identifierFollowing rules must be observed:- the tail starts with an underscore- the underscore must be followed by a letter or a digit- following this, there may be 0 or more letters or digits-}-- empty list would result in empty underscore tailunderscoreTail [] = ([], [])-- at least two characters are required in undescore tailunderscoreTail [x] = ([], [x])-- the first character must be an underscore-- 2nd character must be a letter or digit-- following characters (if any) must be letter or digitunderscoreTail l@(x:y:xs) = if isUnderScore x && isLetterOrDigit y                               -- lets read the rest of letters/digits of this                               -- tail                               then let (first,rest) = readLettersOrDigits xs                                    in (x:y:first, rest)                               -- there is no underscoreTail here                               else ("", l)underscoreTails :: String->(String, String)-- reads 0 or more occurances of underscoreTailunderscoreTails xs = let -- lets try to read the first underscoreTail                         (first, rest) = underscoreTail xs in                         if null first                            -- if first tail is absent, then we just return                            then (first, rest)                            -- otherwise we read rest of the tail recursively                            else let (first', rest') = underscoreTails rest                                 -- we join the first tail and rest of the tail                                 -- we return this along with unread string                                 in (first ++ first' , rest')-- skip layout and commentskipLayoutAndComment = skipLayout . skipComment-- skips layoutskipLayout [] = []skipLayout l@(x:xs) = case isLayout x of                                True -> skipLayout xs                                False -> lskipComment :: String->String{- skips commentThe first character must be #.Following this, we need to keep skipping till we findanother #.-}skipComment [] = []skipComment l@(x:xs) = case isCommentStart x of                                True -> -- we need to find the end of comment                                    skipTillCommentEnd xs                                False -> l-- skips till the end of comment character is found-- this should be called from skip comment function-- this routine is perfect for recursive tail optimizationskipTillCommentEnd [] = []skipTillCommentEnd l@(x:xs) = case isCommentEnd x of                                True-> xs                                False-> skipTillCommentEnd xs-- character level tests-- a layout characterisLayout ch = ch <= ' '-- comment start characterisCommentStart ch = ch == '#'-- comment end characterisCommentEnd ch = ch == '#'-- upper case letterisUpperCase ch = 'A' <= ch && ch <= 'Z'-- lower case letterisLowerCase ch = 'a' <= ch && ch <= 'z'-- a letterisLetter ch = isLowerCase ch || isUpperCase ch-- a digitisNumDigit ch = '0' <= ch && ch <= '9'-- a letter or a digitisLetterOrDigit ch = isLetter ch || isNumDigit ch-- an underscoreisUnderScore ch = ch == '_'-- an operatorisOperator ch = ch elem "+-*/"-- a separatorisSeparator ch = ch elem ";,(){}"main :: IO ()main = do    putStrLn "Simple Lexical Analyzer v0.1"    mainloopmainloop :: IO ()mainloop = do    putStr ">>> "    hFlush stdout    inputStr <- getLine    case inputStr of      "q" -> return ()      otherwise ->        do          let tokens = lexer inputStr          putStrLn \$ show tokens          mainloop

A typical session for this program looks like follows:

runghc MCDLex.hsSimple Lexical Analyzer v0.1>>> abc[Identifier "abc",EOF]>>> 2344[IntegerToken 2344,EOF]>>> +[Operator '+',EOF]>>> +-/*[Operator '+',Operator '-',Operator '/',Operator '*',EOF]>>> (){},;[Separator '(',Separator ')',Separator '{',Separator '}',Separator ',',Separator ';',EOF]>>> #thiscommentwillbeignored"  todo_list[EOF]>>> #thiscommentwillbeignored#  todo_list[Identifier "todo_list",EOF]>>> (3+4)[Separator '(',IntegerToken 3,Operator '+',IntegerToken 4,Separator ')',EOF]>>> errors__[Identifier "errors",Error "_",Error "_",EOF]>>> 23___45[IntegerToken 23,Error "_",Error "_",Error "_",IntegerToken 45,EOF]>>>

Note: I tried to use the "prettyprint" javascript support to beutify this code. But it seems like the pretty printer is not able to parse haskell properly. Hence I am removing it and showing the code in plain pre tag.

## Sunday, October 18, 2009

### AST annotations

AST (Abstract Syntax Trees) annotation is an important step in compilation process. I have been trying to figure out how these would be implemented in Haskell. By studying a paper on SPARK (The design and implementation of SPARK, A Toolkit for Implementing Domain Specific Languages), I sort of developed a general intuitive understanding of this process. You need to traverse over the AST in either pre-order or post-order or a mix of pre and post order. And you need to add extra data to each node in the AST during this traversal. Additionally, you can conduct an extensive set of semantic tests during this traversal process.

So how does it map to Haskell? Here are my guesses AST annotation is essentially an isomorphic process which takes an input AST and generates an output AST. The structure of AST is maintained isomorphically while additional information is stuck on each node in the AST during the transformation process. Also, if there are semantic errors detected during the annotation, we simply generate a list of such errors and return an empty AST as a result. Eigher AST or Errors. Naturally errors will have a reference to source code locations where the problem was observed.

I don't know, whether my understanding is correct or not. I guess I am reasonably correct. :-) I need to study a lot more in depth.

## Thursday, October 15, 2009

### Hope Lexical analyzer is working

I wrote quite a significant part of lexical analyzer for Hope (http://kenai.com/projects/hope) (Haskell Oriented Python Environment). Its primarily based on Parsec library. Currently the code is organized as follows. Interpreter/src directory contains a folder PyLex. PyLex package contains various files which implement the lexical analysis for Hope. The main module is PyLex.Lexer. You can start with the function pyLexer which a Parsec based parser and returns a list of tokens. Each token is a pair of (SourcePos, PyToken). PyToken data-type represents various types of tokens supported in Hope's lexical analyzer. The tokens have been defined in PyLex.Tokens file. The implementation of parsing for these tokens is spread over multiple modules (Identifier, Number, Operator, String).

I faced quite a few issues in reaching up to this stage. I read about Parsec from a number of sources.
• Parsec tutorial and reference guide
• Real world Haskell (online book)
• Develop your own scheme interpreter in 24 hours
It took me quite a while in getting a good feel about how to use Parsec. I would try to write about some of the things I learnt. The import statements in the Parsec tutorial are outdated. Correct imports are:

import Text.ParserCombinators.Parsec
import qualified Text.ParserCombinators.Parsec.Token as PT
import qualified Text.ParserCombinators.Parsec.Language as PL
import qualified Text.ParserCombinators.Parsec.Pos as PP

(I used qualified imports, you may not need them).

The tutorial says that 2 stage lexical analysis and parsing is the right way to go. It says that the lexical analyzer should return a list of tokens and their source positions. It took me quite a while to figure out how to extract the source position inside the character parser when required. And it turns out to be quite easy. Here is an example (sorry for improper indentation):

-- unsupported character inside a python file
pyUnsupported :: Parser Token
pyUnsupported = do pos <- getPosition
c <- anyChar
return (pos, (PyUnsupported c))
"unsupported character"

Important thing to notice is the pos <- getPosition call. This is the way we can extract the source position.

A nice function I wrote to convert a list of digits in base N to an integer. Here it is:

-- converts a string of digits to an integer for a given base
-- n is an accumulator which should be initialized with 0
-- example : convToDecimal [1,1,0,0] 2 0
convToDecimal :: [Int]->Integer->Integer->Integer
convToDecimal [] base n = n
convToDecimal (x:xs) base n = convToDecimal xs base (n * base + fromIntegral x)

The syntax for strings in Python is really quite sophisticated. You have got short strings and long strings ("""abc""" types). And both strings can have single quotes or double quotes. And then you have raw strings. And finally unicode strings and byte strings have different prefixes. Off course a lot of escape characters have to be especially handled inside the strings. My current implementation is not comprehensive but it covers pretty much all normal unicode strings. It doesn't yet cover raw strings and byte strings. I would cover them later. First I got to get the interpreter working. I don't really know if my parser is really great. I don't have enough Haskell experience yet to be able to say that. Still I am happy I got something working finally.

Currently going through Modern Compiler Design book. I hope to get the parser working and ASTs generated. May be then I will be able to translate them into Python Byte Codes and then get an interpreter virtual machine working! Its a lot of work and I am not a computer scientist :)

## Thursday, October 1, 2009

### PyCon India 2009 My impressions

I attended 1st day (26 September 2009) of the PyCon India 2009 conference. It was first Python Conference in India and I felt inclined enough to go all over to Bangalore from my home place Noida and attend the conference.

The conference was organized at IISc Bangalore. From the main gate to reach the conference area was easy as proper directions had been placed all over. Registration was a bit chaotic. There were not enough desks to handle the huge crowd waiting to be registered. I guess it was still quite satisfactory process considering the fact that it was first such event in India and there were not enough funds (I guess) to manage the event professionally. Since I had pre-registered, I had to pay just 200 rupees for attendance. They provided us with schedule, a T shirt, a notebook, a pen and off course a badge which was quite good as a conference kit. And the lunch and tea breaks were excellent.

There were about 600+ delegates participating in the conference. The profile varied between students, professors, independent consultants, professional programmers etc.

The first session by Prabhu Ramchandran (My adventures) was quite nice. It was interesting to see how an open source project Mayavi had been run successfully over the years out of India. Since there were three parallel tracks going on, hence I had to identify the sessions I wished to attend.

Some of the sessions had a lot of issues related to presentation. The presentation on zc.buildout was quite boring. The presenter spent too much time just discussing what is isolation and repeatability that there wasn't much time left for actual demonstration. Similarly, the session on DJango turned out to be quite boring. There were three presenters taking turns in the talk. They spent a lot of time preaching about DJango, like what it is, why it is great, etc. etc. They also spend a lot of time typing the code which they wished to demonstrate. I guess one should be very careful in using the 45 minutes available during a conference time and make the best use of it as possible. One should avoid time wasters like typing code and fixing bugs during the talk itself. Another boring session was on Semantic web. The presenter was explaining a whole lot of difficult concepts. It seemed like a typical IIT lecture in which students preferred to sleep through or if possible bunk them.

Another interesting talk was 'Jython at Strand LS' by Janaki where the speaker explained how they used Jython at their company. Examples were short and sweet. Illustrations were easy to understand and the speaker had a great sense of humour. I was impressed to see the extensive application of Jython in a production quality platform like AVADIS (TM) at Strand.

Finally, my own talk on using ZPT and RML for generating PDF in Python applications came. I didn't expect too many to attend the talk but I was happy to see about 30+ ppl still continuing to sit for my session. Some of them left during the session which meant that they didn't really enjoy the content much. Others continued through the session and I guess they liked what I talked. I would leave it for others to comment on how good the session was.

There were tracks for lightning talks but I guess very few speakers were actually prepared to conduct them. There were lots of talks registered but ultimately many of them were not delivered.

Some of the things I feel that should be improved for future conferences:

• Expected audience of a talk should be more clearly explained
• Focus should be more on specific coding workshops rather than general talks
• Longer workshops on specific topics should be conducted if there are speakers available
• There was no peer review of presentation material. This should be extensively fixed.
• The work on presentation materials should actually finish 2-3 months before the conference at least so that peer reviews and revisions can be completed say about 2 weeks before actual conference.
• Scheduling of talks should happen through a more transparent process
• Registered delegates should receive notifications about talks (when its announced, when presentation materials are uploaded etc.) over mail as the process keeps on going before the actual conference
• The talks without presentation material should be rejected in the final schedule.
• Speakers should avoid wasting too much time talking about abstract concepts and spend more on illustrations. They should also avoid typing code during the session.
• Multi-speaker talks should be minimized