My earlier post (about Markdown processors and how simple top-down parsers can be up to two or three orders of magnitude faster than regexp-based gsub kludges) got unexpectedly popular. Some critics accused me of deliberately bastardizing Markdown's syntax for the sake of implementation convenience and speed. In order to prove that this is not the case, I reverted the changes relative to Markdown I introduced with ease of writing in mind and measured again. The exact changes are listed below. According to git, the resulting code is in fact shorter:
I think this, if anything, shows that the motivation behind my changes was not to simplify the implementation and make it shorter. As for the performance, the 330KB document I was using is processed in 50ms (vs. 63ms for discount, 2.7s for Pandoc, 2.1s for python-markdown or 30s for Bluecloth).
Transforming the AST into (statically) valid XHTML
As I said in my earlier post, I quickly recognized the advantages of going beyond mere "markup -> HTML" conversion and being able to manipulate the AST. New targets can be supported in a very compact way; for instance, (X)HTML generation takes only 40 lines of code. The code is clear and concise thanks to the use of pattern matching (it's also fast :). Moreover, I can use higher-order functions to allow for specific processing of "interesting" elements. For instance, I'm rendering pre-formatted blocks, links and images differently depending on whether the markup comes from a comment or was written by me (e.g. some extensions are allowed only in the latter case, and only the alt attribute is shown in the former).
The code below simply generates a typed HTML tree from the markup tree. I'm using Ocsigen's XHTML.M module, which employs OCaml's type system to ensure that the resulting HTML is valid. Invalid markup just doesn't compile, as you'd get a type error; e.g., if I try to express something invalid like <b><li>foo</li></b>, the type system will catch it and tell me the acceptable elements:
# b [li [pcdata "foo"]];;
-----------------
Error: This expression has type ([> `Li ] as 'a) elt = 'a XHTML.M.elt
but is here used with type
[< `A | `Abbr | `Acronym | `B | `Bdo | `Big | `Br | `Button
| `Cite | `Code | `Del | `Dfn | `Em | `I | `Img | `Input | `Ins
| `Kbd | `Label | `Map | `Noscript | `Object | `PCDATA | `Q
| `Samp | `Script | `Select | `Small | `Span | `Strong | `Sub
| `Sup | `Textarea | `Tt | `Var ] elt = 'b XHTML.M.elt
The second variant type does not allow tag(s) `Li
The compiler (or toplevel) not only tells me the HTML would not be valid (li not allowed inside b), but also lists the elements I can use there.
Here are the ~40 lines of code that generate the HTML on this site (the resulting tree is included in a larger, typed tree --- ensuring the end result is valid when the page is composed). There are no type annotations whatsoever, as OCaml is able to infer everything here (OTOH, I can get the type of any single expression with a couple keypresses in vim; you can also generate the interface with the type of all the functions with ocamlc -i):