This little program might be interesting if you are simply interested in learning a bit about JavaScript and/or regular expressions, or you could use a tokenizer for language parsing.
The reason I wrote it was because I am developing a prototype of a multi-pass compiler based purely on XSLT.
For the uninitiated XSLT stands for eXtensible Stylesheet Language Transformations. It is a turing-complete programming language, and a surprisingly powerful one at that.
XSLT 1.0 (what is supported by most browsers) does not support regular expressions, though the working draft of XSLT 2.0 does. So for the purposes of the prototypes I am using a JavaScript.
Recently I had been blogging about the idea of a macro language being more effective as a pattern matching language, than a procedurally based language. I have since uncovered XSLT, and it is one heck of a powerful macro language.
Originally I was just looking at using XSLT to manipulate Abstract Syntax Trees (AST) when it dawned on me that not only could I rewrite an AST using XSLT, I could also generate an AST from a Parse Tree, and I could generate a parse tree from a token list. The only step missing was a tokenizer.
I know that some people will see this as a purely academic exercise, or more specifically that I am a monkey with a hammer. The thing is that theoretically it appears that with an XSLT document it is possible to generate a language agnostic representation (ASTXML). Then another XSLT document can be used to generate source in virtually any language.
Thanks for pointing these out Reno. Here is a naive question: why would the RELAX NG compiler, and similar tools, not simply use XSLT to generate Java code?
Don't forget to peek at the source. For those wondering, i"so what", well I want you to think about what is involved in writing a compiler/translator in the traditional way. There are typically six main phases:
1) tokenizer (currently done by my JavaScript program, but I just learned I can in fact write one in XSLT 1.0, albeit an inefficient one.
2) token list -> parse tree
3) parse tree -> abstract syntax tree
4) abstract syntax tree -> optimizable format (this phase is optional)
5) optimizable format -> optimized format (this phase can be repeated multiple times)
6) optimized format -> some executable format or another source code
Now this usually represents a massive amount of work in an imperative language like C/C++/Java, however using a declarative pattern matching language like XSLT, means that with a handful of a few simple documents, an entire compiler can be built with very little work. Anyway that's the theory.
>why would the RELAX NG compiler, and similar tools, not simply use XSLT to generate Java code?
It's only a supposition: Because we have to process a RELAX NG document (the grammar definition, well, a 'BNF' for xml). It might be easier to do it in another language than XSLT alone!
Anyway, your approach seems interesting. Are you sure the phase 3) "parse tree -> abstract syntax tree" can be handled easily ?
> >why would the RELAX NG compiler, and similar tools, not > simply use XSLT to generate Java code? > > It's only a supposition: > Because we have to process a RELAX NG document (the > grammar definition, well, a 'BNF' for xml). It might be > easier to do it in another language than XSLT alone!
That might be it. XSLT is a bit of a bear to work with. I would much rather work in a procedural language than a declarative one. Perhaps I am simply not smart enough to design and understand declarative programs ;-)
> Anyway, your approach seems interesting. Are you sure the > phase 3) "parse tree -> abstract syntax tree" can be > handled easily ?
I think so. I believe the only hard part was making a linear form like the tokens specification (which is effectively a flat list) into a hierarchical structure. The next challenge will be to isolate statements into grouped nodes, which should be quite easy.