This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Parsing - oh the humanity.
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I got pretty frustrated working on this project for one of my clients.
Not with them but with the code I wrote.
They have some commercial analyis code that generates a classification
model and produces a SAS script as output. The script is used to make
predictions based on the model.
They want to use this model on the server, which doesn't have SAS but
does have Python (the server is written in Python :). Looking at it,
the generated script produced code which is similar in style to C.
There's a lot of gotos, a somewhat strange call syntax which reminds
me of GOSUB, and a couple other minor differences, but nothing exotic.
I figured it should be relatively simple to translate into a C
extension module for Python. It took me four attempts. But did I
learn anything new? Here are the attempts.
Attempt 1: I've done a huge amount of parsing. It's the bread
and butter of most of my jobs and contracts. (Sad, but true. One of
the reasons I think people want XML is because after a decade of
parsing yet another file format it starts to get boring.) Most of the
time I use a bunch of regexps, with plenty of sanity checks to make
sure the file formats don't stray too much from my expectations. In
fact, usually I'm very stringent in the testing. This means my code
breaks loudly when the format changes from underneath me, as compared
to breaking silently.
This case is different. Most of the data files I read aren't designed
for automatic context-free parser generators so trying to get the
normal lexer/parser combo working with it is hard. There's lots of
state to communicate between them. But in this case I was seduced by
the data being in the SAS language, and designed for parsing.
I got a copy of
SPARK
(I know there are newer projects, but I've used SPARK before so know
what to expect), wrote a Scanner and a Parser and after a few hours got
the SAS code parsed, and another hour or so adding support so errors
are reported as "line 34 position 56" instead of byte position. With
only a bit more work I got a code generator to traverse the parse tree
and emit C.
I didn't like the C output. The original SAS code contains comments
which provide information to the reader about what was going on. For
example, the top of the file had comments about the source of the data
used to generate the model and the input parameters, and bits of the
code itself were also commented. I think the people who wrote the
code which generated the SAS output did a good job of making the
result look nice.
By comparison, my C code was flat and unreadable. I lost the comments
and the spaces between sections. I tried a workaround. It turns out
that the comments were only used where a SAS statement could occur so
I told the tokenizer to return a "comment" token and added that as a
valid statement. I had to do the same with blank lines.
The result was still ugly. Sometimes the comments are on the same
line and to the right of the original code. My hack didn't allow
that; those comments ended up below the corresponding C code.
Attempt 2: So I tried another tack. I had the scanner capture
the line numbers of each token and comment and return the token list
and comment list as independent objects. I figured when I generated
the C output from the tokens I could also tell it to add the comments
which corresponded to that line number. I worked on this for a while,
and fixed a BNF rule which was right-recursive instead of
left-recursive (or vice versa) which really sped up the parsing.
The resulting code just felt like ooze. Part of the complication was
support for code rearrangment. The SAS code used forward references
for its "link" statements (those GOSUBS) which I didn't want in the C
code because I think code is more graceful without prototypes. It's
different for code meant to be used as a library. In this case though
everything is in a single file and the only code exposed to the
outside world is through Python's initmodule function.
Once I got it working I needed to make it deal with a bit of syntax
rewriting. The SAS code uses a test against ".z" to tell if a
variable is defined or not. if VAR > .z then ... means "if
VAR is defined". My Python code supports the same idea by testing
against None, so the equivalent would be if VAR is not None.
C doesn't support the concept of "undefined value". One way to fake
it is to have a special bit pattern to indicated undefined, but I
didn't want to do that.
(Capt'n Crunch
taught well the problems of in-band signalling.) I decided to have two
variables, "VAR_defined" is 1 when VAR is defined, 0 otherwise, and
"VAR_value" is the value of the Python variable when VAR is defined,
otherwise a special bit pattern (just to be on the safe side).
Only some of the variables may be undefined. It depends on what
parameters were used to make the model. For variables which are
allowed to be undefined I want the C/Python interface to set the
corresponding values, but for those which must be defined I want to
raise a TypeError if the Python value is None. To figure out which
variables are which I need to scan the parse tree (uggh, yes, it's
just the visitor pattern but still uggh). And to generate the right
code I need to special case test the BinaryOp node in case the left is
a variable and the right is UNDEF. Or I could do both in one step
with some tree rewriting to convert the BinaryOp subtree into a
special "IsDefined('VAR')" node. I know there are tree rewrite
languages but the only one I've tried was XSLT, which I found nasty,
brutish, and long.
Also, some of the variables may be 'continuous' while others are
'categorical', which are handled as Python floats and ints and C
doubles and longs. These are documented in the SAS code through
structured comments.
After thinking about the amount of code I would need to add to support
tree rewriting I decided to chuck it and do ...
Attempt 3: Remember I started by saying that I've done a huge
amount of data parsing? I even wrote a parser generator called Martel, which uses a
regular expressions to convert a flat-file into SAX 2 events for XML.
The idea of Martel is that formats which aren't designed to be parsed
by a scanner/parser combination are almost all regular grammers --
highly stateful, in that the same text likely means different things
in different parts of the file -- but regular.
What if I used the same approach here? I started setting up a Martel
grammer. One for the structured comments in the header, another for
the main driver code, and one for the GOSUB units. I also wrote an
adapter for the effbot's
elementtree to
turn those SAX events into a Python data stucture which also supports
a subset of XPath.
Ahh, much nicer. The biggest initial problem (as usual) was getting
the right regexp definitions. One of Martel's failings is that it
doesn't do a good job of error reporting. What I would like is some
GUI where I have my Martel definition in one pane, target text in
another, and something to show where the parsing failed and why.
ActiveState has something like that based on Perl regexps, but they
aren't as powerful (for my needs) as Martel; I want to see the parsed
XML tree as well.
The next problem was what to do with the code blocks once I got them.
While SAS is a context-free language, the code I'm dealing with has at
most two levels, caused by an if-statement. A regexp language
definition could handle it easily, but the result would still require
manipulations of a tree.
The last problem is my clients don't have Martel installed on their
system. I would rather keep everything limited to a single file, an
executable Python program. Leading us to ...
Attempt 4: This time I treated the whole program as one string.
The code is machine generated and very consistent and well-structured
so I can assume there won't be variations in the line structure. This
means I can use a few regexps (with 'findall') to extract data from
the structured comments and to find out which variables are allowed to
be undefined.
Even better, it turns out that each line can be uniquely categorized
solely by looking at the line and the category test can be a regular
expression. And even better, the conversion to C code can almost
always be done as an re.sub! So the following works
_conversions = (
(re.compile(r"\s*$"), None),
(re.compile(r"\s*/\*.*\*/$"), None),
(re.compile(r"tnscore = 0\.0;"), None),
(re.compile(r"TN(\d+):$"), r"=FUNCTION=TN\1="),
# Leave all the labels
(re.compile(r"[A-Za-z_][A-Za-z0-9_]*:$"), None),
# The (\s+) is used to keep the indentation the same.
# The test against '.z' is a check for undefined values
(r"(\s+)if (%s) > \.z then (goto %s);" % (_varname, _varname),
r"\1if (\2_defined) {\3;}"),
...
def convert_text(s, conversions):
lines = s.split("\n")
for lineno in range(len(lines)):
line = lines[lineno]
for pat, conv in conversions:
if pat.match(line):
if conv is not None:
lines[lineno] = pat.sub(conv, line)
break
else:
raise TypeError("What do I do with %r?" % (line,))
return "\n".join(lines)
Granted, it's O(number of lines * number of patterns) but there are
under 20 patterns. And it requires the input be in the right format,
but I'm guaranteed that that's the case. I do have some sanity tests
to catch and report the major errors, like passing in a totally wrong
data, and of course I do verify that every line starts as expected.
In the above I place a special marker, '=FUNCTION=TN\1=' in the
converted code. This is where the function definition should be,
which is done later in the processing. Oh, and notice that my
definitions stop at the ';' (end of statement delimiter in SAS) and
don't go to the end of the line. The code I'm dealing with has at
most one statement per line, so this ignores any comments which might
be on the line. Both SAS and C use /* this comment style */ so that
was a nice way to keep the comments unchanged. Though were I feeling
more stringent I would also require the comments fit in the regexp for
the line.
Some 700 more lines later and 430 line of regression tests and I'm
done. This compares quite favorably to attempt 2 where my incomplete
solution (didn't have the tests for undefined values, nor support for
categorical variables, nor tests, nor command-line driver) was already
at 930 lines.
Okay, so what's the take-home lesson, and why are you the reader still
here after about 1,700 words of essay? I still haven't figured it out
myself. Why did I choose the context-free approach? Because the
source data is a context-free language and you're supposed to use
parser generators to read them. And because SPARK and related tools
are cool, so perhaps I was distracted by their power.
I wrote Martel in part because presentation information is important.
Most parsers keep only the semantic data, tossing syntax. One of my
driving use cases was to mark up a file for HTML, leaving the original
format untouched, adding only hyperlinks and color. But in my case I
wanted to convert the input and include the syntactic information
about where comments and extra newlines are located. And in this SAS
conversion case I also want to keep the non-semantic information about
the commments and the newlines intact.
Maybe there's some standard way to do this for CF systems I just
haven't learned about? It seems if everything (token, comment,
newline) had a location, and if the translation system looked at a
element in the token stream, got the location, and said "emit text at
this location". But no, that doesn't work because of the code
rearrangement, unless there was better system support for retargeting
locations.
One lesson is that if you're dealing with machine generated code then
take advantage of the guarantee that the code won't be that varied.
Part of the reason the regexp approach works is because there are
fewer than 20 line types.
*sigh* I don't think there is a good lesson here. Just another week
in the school of experience.