Tuesday, October 20, 2009

A simple lexical analyzer demo in Haskell

Do get some better Haskell practice, I developed a small lexical analyzer in Haskell. This uses basic Haskell functionality and doesn't use sophisticated libraries like Parsec.

Here is the output of this effort:



{-
This script implements a simple lexical analysis program
based on the example in Modern Compiler Design Book Section 2.1


Tokens
--------

The lexical analyzer supports following tokens.

Operators: +, -, * and /
Separators: ; , {} ( and )
Integers (a string of digits)
Identifiers: first character must be a letter
other characters can be letter or digit or underscore
two underscores consecutively are not allowed
last character cannot be an underscore
EOF token : represents end of file

Also supported are:

Comments: They start with # and end with #
Layout characters: all characters less than equal to ' ' are
considered as layout characters

Limitations
----------------

Some specific limitations of this implemenation are:
- Its a limited form of recursive descent parser. Based on
the first character in the token, we assume the token type
and continue to read the rest of token.
- It doesn't track the source code position. This would actually
require some form of monad since the source code position tracking
is a sort of side effect to the lexical analysis.
- The lexical analyzer doesn't support too many debugging facilities.
- Lack of unit tests at this point of time.

-}



{- following data type defines the tokens supported by this lexer
the C version in the MCD book uses #defines for different token
classes. Haskell version is much more cleaner with proper data
constructors attached to each token class and automatically
generated show and eq functions.
-}
import System.IO

-- a token can be
data Token = EOF | -- end of file
Identifier String | -- an identifier string
IntegerToken Integer | -- an integer
Operator Char | -- an operator
Separator Char | -- a separator
Error String -- or some erroneous token
deriving (Show, Eq)



-- lexer reads a string and returns a list of tokens
lexer :: String->[Token]
{- we need to remove comments and other layout characters
before reading a token. This is an example of conducting
some preprocessing before doing an actual operation.
Such a situation should be implemented using two sister
functions (here lexer and lexer').
lexer does the job of removing layout and comments
lexer' does the job of reading a token
lexer calls lexer' (for remaining data)
lexer' calls back lexer (for remaining data)

This kind of strucutre eliminates complicated if-then-else logic.
-}
lexer xs = let s = skipLayoutAndComment xs
in lexer' s

-- for an empty string, lex returns an empty list
lexer' [] = [EOF]
-- for a non-empty string we read the first token
-- then we let lexer tokenize the rest of string
lexer' xs = let (t, rest)= nextToken xs
in (t:lexer rest)

-- next token reads the input string and returns the next token and
-- rest of string
nextToken:: String->(Token, String)

nextToken [] = (EOF, [])
{- The implementation is a five way branch based on the first character.
If the first character is a letter, we read an identifier.
If the first character is a digit, we read a number.
If the first character is an operator, we simply read an operator token.
If the first character is a separator, we simply read a separator token.
Otherwise, we consider this as an erroneous character and generate
an error token for this character.

The lexical analyzer shouldn't stop analyzing the file at any point
of time. If it doesn't recognize any character or a string of characters
in the file, then it should simply generate an Error token at that point
of time and proceed further with reading next tokens.

It should be the job of parser to identify real syntax errors in the
code and report the user accordingly.
-}
nextToken l@(x:xs) | isLetter x = let (first, rest) = readIdentfier l
in (Identifier first, rest)
| isNumDigit x = let (first, rest) = readInteger l
in (IntegerToken first, rest)
| isOperator x = (Operator x, xs)
| isSeparator x = (Separator x, xs)
| otherwise = (Error [x], xs)

-- reads an integer and returns the token plus rest of string
readInteger xs = let (first, rest) = span isNumDigit xs in
(read first :: Integer, rest)


{-
An identifier consists of firstCharacter and restCharacter*
firstCharacter can be a letter
restCharacter can be letter or digit or underscore
more than one underscores consecutively are not allowed
this should be called only if first character of the input
list is a letter.

Here we assume that the first character is a letter. This
is actually ensured by the nextToken function defined above.
We thus focus on reading rest of identfier.
-}
readIdentfier [] = ([], [])
readIdentfier l@(x:xs) = let (first, rest) = readLettersOrDigits xs
(first', rest') = underscoreTails rest
identifier = (x:first) ++ first'
in (identifier, rest')


readLettersOrDigits :: String->(String, String)
{-
Reads an input string and splits it in two parts. The
first part (possibly empty) consists of a sequence of letter
or digit characters.

The 2nd part contains the rest of input string.
-}
readLettersOrDigits xs = span isLetterOrDigit xs

underscoreTail :: String->(String, String)
{-
Reads an undscore tail part of an identifier
Following rules must be observed:
- the tail starts with an underscore
- the underscore must be followed by a letter or a digit
- following this, there may be 0 or more letters or digits
-}
-- empty list would result in empty underscore tail
underscoreTail [] = ([], [])
-- at least two characters are required in undescore tail
underscoreTail [x] = ([], [x])
-- the first character must be an underscore
-- 2nd character must be a letter or digit
-- following characters (if any) must be letter or digit
underscoreTail l@(x:y:xs) = if isUnderScore x && isLetterOrDigit y
-- lets read the rest of letters/digits of this
-- tail
then let (first,rest) = readLettersOrDigits xs
in (x:y:first, rest)
-- there is no underscoreTail here
else ("", l)

underscoreTails :: String->(String, String)
-- reads 0 or more occurances of underscoreTail
underscoreTails xs = let -- lets try to read the first underscoreTail
(first, rest) = underscoreTail xs in
if null first
-- if first tail is absent, then we just return
then (first, rest)
-- otherwise we read rest of the tail recursively
else let (first', rest') = underscoreTails rest
-- we join the first tail and rest of the tail
-- we return this along with unread string
in (first ++ first' , rest')

-- skip layout and comment
skipLayoutAndComment = skipLayout . skipComment

-- skips layout
skipLayout [] = []
skipLayout l@(x:xs) = case isLayout x of
True -> skipLayout xs
False -> l
skipComment :: String->String
{- skips comment
The first character must be #.
Following this, we need to keep skipping till we find
another #.
-}
skipComment [] = []
skipComment l@(x:xs) = case isCommentStart x of
True -> -- we need to find the end of comment
skipTillCommentEnd xs
False -> l

-- skips till the end of comment character is found
-- this should be called from skip comment function
-- this routine is perfect for recursive tail optimization
skipTillCommentEnd [] = []
skipTillCommentEnd l@(x:xs) = case isCommentEnd x of
True-> xs
False-> skipTillCommentEnd xs

-- character level tests
-- a layout character
isLayout ch = ch <= ' '
-- comment start character
isCommentStart ch = ch == '#'
-- comment end character
isCommentEnd ch = ch == '#'
-- upper case letter
isUpperCase ch = 'A' <= ch && ch <= 'Z'
-- lower case letter
isLowerCase ch = 'a' <= ch && ch <= 'z'
-- a letter
isLetter ch = isLowerCase ch || isUpperCase ch
-- a digit
isNumDigit ch = '0' <= ch && ch <= '9'
-- a letter or a digit
isLetterOrDigit ch = isLetter ch || isNumDigit ch
-- an underscore
isUnderScore ch = ch == '_'
-- an operator
isOperator ch = ch `elem` "+-*/"
-- a separator
isSeparator ch = ch `elem` ";,(){}"

main :: IO ()
main = do
putStrLn "Simple Lexical Analyzer v0.1"
mainloop

mainloop :: IO ()
mainloop = do
putStr ">>> "
hFlush stdout
inputStr <- getLine
case inputStr of
"q" -> return ()
otherwise ->
do
let tokens = lexer inputStr
putStrLn $ show tokens
mainloop







A typical session for this program looks like follows:


runghc MCDLex.hs
Simple Lexical Analyzer v0.1
>>> abc
[Identifier "abc",EOF]
>>> 2344
[IntegerToken 2344,EOF]
>>> +
[Operator '+',EOF]
>>> +-/*
[Operator '+',Operator '-',Operator '/',Operator '*',EOF]
>>> (){},;
[Separator '(',Separator ')',Separator '{',Separator '}',Separator ',',Separator ';',EOF]
>>> #thiscommentwillbeignored" todo_list
[EOF]
>>> #thiscommentwillbeignored# todo_list
[Identifier "todo_list",EOF]
>>> (3+4)
[Separator '(',IntegerToken 3,Operator '+',IntegerToken 4,Separator ')',EOF]
>>> errors__
[Identifier "errors",Error "_",Error "_",EOF]
>>> 23___45
[IntegerToken 23,Error "_",Error "_",Error "_",IntegerToken 45,EOF]
>>>


Note: I tried to use the "prettyprint" javascript support to beutify this code. But it seems like the pretty printer is not able to parse haskell properly. Hence I am removing it and showing the code in plain pre tag.

Sunday, October 18, 2009

AST annotations

AST (Abstract Syntax Trees) annotation is an important step in compilation process. I have been trying to figure out how these would be implemented in Haskell. By studying a paper on SPARK (The design and implementation of SPARK, A Toolkit for Implementing Domain Specific Languages), I sort of developed a general intuitive understanding of this process. You need to traverse over the AST in either pre-order or post-order or a mix of pre and post order. And you need to add extra data to each node in the AST during this traversal. Additionally, you can conduct an extensive set of semantic tests during this traversal process.

So how does it map to Haskell? Here are my guesses AST annotation is essentially an isomorphic process which takes an input AST and generates an output AST. The structure of AST is maintained isomorphically while additional information is stuck on each node in the AST during the transformation process. Also, if there are semantic errors detected during the annotation, we simply generate a list of such errors and return an empty AST as a result. Eigher AST or Errors. Naturally errors will have a reference to source code locations where the problem was observed.

I don't know, whether my understanding is correct or not. I guess I am reasonably correct. :-) I need to study a lot more in depth.

Thursday, October 15, 2009

Hope Lexical analyzer is working

I wrote quite a significant part of lexical analyzer for Hope (http://kenai.com/projects/hope) (Haskell Oriented Python Environment). Its primarily based on Parsec library. Currently the code is organized as follows. Interpreter/src directory contains a folder PyLex. PyLex package contains various files which implement the lexical analysis for Hope. The main module is PyLex.Lexer. You can start with the function pyLexer which a Parsec based parser and returns a list of tokens. Each token is a pair of (SourcePos, PyToken). PyToken data-type represents various types of tokens supported in Hope's lexical analyzer. The tokens have been defined in PyLex.Tokens file. The implementation of parsing for these tokens is spread over multiple modules (Identifier, Number, Operator, String).

I faced quite a few issues in reaching up to this stage. I read about Parsec from a number of sources.
  • Parsec tutorial and reference guide
  • Real world Haskell (online book)
  • Develop your own scheme interpreter in 24 hours
It took me quite a while in getting a good feel about how to use Parsec. I would try to write about some of the things I learnt. The import statements in the Parsec tutorial are outdated. Correct imports are:

import Text.ParserCombinators.Parsec
import qualified Text.ParserCombinators.Parsec.Token as PT
import qualified Text.ParserCombinators.Parsec.Language as PL
import qualified Text.ParserCombinators.Parsec.Pos as PP

(I used qualified imports, you may not need them).

The tutorial says that 2 stage lexical analysis and parsing is the right way to go. It says that the lexical analyzer should return a list of tokens and their source positions. It took me quite a while to figure out how to extract the source position inside the character parser when required. And it turns out to be quite easy. Here is an example (sorry for improper indentation):

-- unsupported character inside a python file
pyUnsupported :: Parser Token
pyUnsupported = do pos <- getPosition
c <- anyChar
return (pos, (PyUnsupported c))
"unsupported character"


Important thing to notice is the pos <- getPosition call. This is the way we can extract the source position.

A nice function I wrote to convert a list of digits in base N to an integer. Here it is:

-- converts a string of digits to an integer for a given base
-- n is an accumulator which should be initialized with 0
-- example : convToDecimal [1,1,0,0] 2 0
convToDecimal :: [Int]->Integer->Integer->Integer
convToDecimal [] base n = n
convToDecimal (x:xs) base n = convToDecimal xs base (n * base + fromIntegral x)


The syntax for strings in Python is really quite sophisticated. You have got short strings and long strings ("""abc""" types). And both strings can have single quotes or double quotes. And then you have raw strings. And finally unicode strings and byte strings have different prefixes. Off course a lot of escape characters have to be especially handled inside the strings. My current implementation is not comprehensive but it covers pretty much all normal unicode strings. It doesn't yet cover raw strings and byte strings. I would cover them later. First I got to get the interpreter working. I don't really know if my parser is really great. I don't have enough Haskell experience yet to be able to say that. Still I am happy I got something working finally.


Currently going through Modern Compiler Design book. I hope to get the parser working and ASTs generated. May be then I will be able to translate them into Python Byte Codes and then get an interpreter virtual machine working! Its a lot of work and I am not a computer scientist :)

Thursday, October 1, 2009

PyCon India 2009 My impressions

I attended 1st day (26 September 2009) of the PyCon India 2009 conference. It was first Python Conference in India and I felt inclined enough to go all over to Bangalore from my home place Noida and attend the conference.

The conference was organized at IISc Bangalore. From the main gate to reach the conference area was easy as proper directions had been placed all over. Registration was a bit chaotic. There were not enough desks to handle the huge crowd waiting to be registered. I guess it was still quite satisfactory process considering the fact that it was first such event in India and there were not enough funds (I guess) to manage the event professionally. Since I had pre-registered, I had to pay just 200 rupees for attendance. They provided us with schedule, a T shirt, a notebook, a pen and off course a badge which was quite good as a conference kit. And the lunch and tea breaks were excellent.

There were about 600+ delegates participating in the conference. The profile varied between students, professors, independent consultants, professional programmers etc.

The first session by Prabhu Ramchandran (My adventures) was quite nice. It was interesting to see how an open source project Mayavi had been run successfully over the years out of India. Since there were three parallel tracks going on, hence I had to identify the sessions I wished to attend.

Some of the sessions had a lot of issues related to presentation. The presentation on zc.buildout was quite boring. The presenter spent too much time just discussing what is isolation and repeatability that there wasn't much time left for actual demonstration. Similarly, the session on DJango turned out to be quite boring. There were three presenters taking turns in the talk. They spent a lot of time preaching about DJango, like what it is, why it is great, etc. etc. They also spend a lot of time typing the code which they wished to demonstrate. I guess one should be very careful in using the 45 minutes available during a conference time and make the best use of it as possible. One should avoid time wasters like typing code and fixing bugs during the talk itself. Another boring session was on Semantic web. The presenter was explaining a whole lot of difficult concepts. It seemed like a typical IIT lecture in which students preferred to sleep through or if possible bunk them.

Another interesting talk was 'Jython at Strand LS' by Janaki where the speaker explained how they used Jython at their company. Examples were short and sweet. Illustrations were easy to understand and the speaker had a great sense of humour. I was impressed to see the extensive application of Jython in a production quality platform like AVADIS (TM) at Strand.


Finally, my own talk on using ZPT and RML for generating PDF in Python applications came. I didn't expect too many to attend the talk but I was happy to see about 30+ ppl still continuing to sit for my session. Some of them left during the session which meant that they didn't really enjoy the content much. Others continued through the session and I guess they liked what I talked. I would leave it for others to comment on how good the session was.

There were tracks for lightning talks but I guess very few speakers were actually prepared to conduct them. There were lots of talks registered but ultimately many of them were not delivered.


Some of the things I feel that should be improved for future conferences:

  • Expected audience of a talk should be more clearly explained
  • Focus should be more on specific coding workshops rather than general talks
  • Longer workshops on specific topics should be conducted if there are speakers available
  • There was no peer review of presentation material. This should be extensively fixed.
  • The work on presentation materials should actually finish 2-3 months before the conference at least so that peer reviews and revisions can be completed say about 2 weeks before actual conference.
  • Scheduling of talks should happen through a more transparent process
  • Registered delegates should receive notifications about talks (when its announced, when presentation materials are uploaded etc.) over mail as the process keeps on going before the actual conference
  • The talks without presentation material should be rejected in the final schedule.
  • Speakers should avoid wasting too much time talking about abstract concepts and spend more on illustrations. They should also avoid typing code during the session.
  • Multi-speaker talks should be minimized