Thursday, October 15, 2009

Hope Lexical analyzer is working

I wrote quite a significant part of lexical analyzer for Hope (http://kenai.com/projects/hope) (Haskell Oriented Python Environment). Its primarily based on Parsec library. Currently the code is organized as follows. Interpreter/src directory contains a folder PyLex. PyLex package contains various files which implement the lexical analysis for Hope. The main module is PyLex.Lexer. You can start with the function pyLexer which a Parsec based parser and returns a list of tokens. Each token is a pair of (SourcePos, PyToken). PyToken data-type represents various types of tokens supported in Hope's lexical analyzer. The tokens have been defined in PyLex.Tokens file. The implementation of parsing for these tokens is spread over multiple modules (Identifier, Number, Operator, String).

I faced quite a few issues in reaching up to this stage. I read about Parsec from a number of sources.
  • Parsec tutorial and reference guide
  • Real world Haskell (online book)
  • Develop your own scheme interpreter in 24 hours
It took me quite a while in getting a good feel about how to use Parsec. I would try to write about some of the things I learnt. The import statements in the Parsec tutorial are outdated. Correct imports are:

import Text.ParserCombinators.Parsec
import qualified Text.ParserCombinators.Parsec.Token as PT
import qualified Text.ParserCombinators.Parsec.Language as PL
import qualified Text.ParserCombinators.Parsec.Pos as PP

(I used qualified imports, you may not need them).

The tutorial says that 2 stage lexical analysis and parsing is the right way to go. It says that the lexical analyzer should return a list of tokens and their source positions. It took me quite a while to figure out how to extract the source position inside the character parser when required. And it turns out to be quite easy. Here is an example (sorry for improper indentation):

-- unsupported character inside a python file
pyUnsupported :: Parser Token
pyUnsupported = do pos <- getPosition
c <- anyChar
return (pos, (PyUnsupported c))
"unsupported character"


Important thing to notice is the pos <- getPosition call. This is the way we can extract the source position.

A nice function I wrote to convert a list of digits in base N to an integer. Here it is:

-- converts a string of digits to an integer for a given base
-- n is an accumulator which should be initialized with 0
-- example : convToDecimal [1,1,0,0] 2 0
convToDecimal :: [Int]->Integer->Integer->Integer
convToDecimal [] base n = n
convToDecimal (x:xs) base n = convToDecimal xs base (n * base + fromIntegral x)


The syntax for strings in Python is really quite sophisticated. You have got short strings and long strings ("""abc""" types). And both strings can have single quotes or double quotes. And then you have raw strings. And finally unicode strings and byte strings have different prefixes. Off course a lot of escape characters have to be especially handled inside the strings. My current implementation is not comprehensive but it covers pretty much all normal unicode strings. It doesn't yet cover raw strings and byte strings. I would cover them later. First I got to get the interpreter working. I don't really know if my parser is really great. I don't have enough Haskell experience yet to be able to say that. Still I am happy I got something working finally.


Currently going through Modern Compiler Design book. I hope to get the parser working and ASTs generated. May be then I will be able to translate them into Python Byte Codes and then get an interpreter virtual machine working! Its a lot of work and I am not a computer scientist :)

7 comments:

Luca said...
This comment has been removed by the author.
Luca said...

why you didn't use alex/happy?

Shailesh Kumar said...

@Luca, I am working on Windows. I did try Happy once but wasn't able to set it up on Windows then. Moreover, I found documentation on Parsec much more extensive and interesting. Haven't got back to Happy since then.

Luca said...

I suggest to try BNFC: http://www.cse.chalmers.se/research/group/Language-technology/BNFC/

Shailesh Kumar said...

Thanx a lot for the pointer. I will look into it.

Luca said...

BNFC is a tool that generate a alex and happy file and from them generate the .hs

alternatively, there is a good parser of python written in Haskell ready to go: just go here http://hackage.haskell.org/package/language-python.

This way you have your abstract syntax tree and you can start from here to do type inference, code generation, optimization etc.

Shailesh Kumar said...

wow. I didn't know about this. This would really save a lot of time! Thanx a lot.