When I started to write the code for the latest incarnation of eigenclass, I was planning to use an existent Markdown processor to generate the HTML for the posts and comments dynamically. That'd take at most a couple lines to pipe the markup to a process and read back the HTML. I took the first Markdown implementation that came to mind, Bluecloth (written in Ruby) and ran it against a few documents. I was most underwhelmed by its speed. It was so slow it'd need over one or two seconds to process some of the entries I've written since. I benchmarked other common implementations, the original markdown (in Perl) and python-markdown, and realized that they were only marginally better. At the risk of being perceived as performance-obsessed, here's the observed performance when processing markdown's README (README.n is README concatenated n times) on a 3GHZ AMD64 box (much faster than the old server running this site):
language
LoCs (approx.)
README.1 time
README.8 time
README.32 time
README.32 MEM
Bluecloth
Ruby
1100
0.130s
2.16s
30s
31MB
markdown
Perl
1400
0.068s
0.66s
segfault
segfault
python-markdown
Python
1900
0.090s
0.35s
2.06s
23MB
Pandoc
Haskell
900 + 450
0.068s
0.55s
2.7s
25MB
Compare to the rather acceptable performance of my own Simple_markup module in OCaml, and of discount, a C implementation I found when I had already written mine:
language
LoCs (approx.)
README.8 time
README.32 time
README.32 MEM
Simple_markup
OCaml
313 + 55
12ms
43ms
3.5MB
discount
C
~4500
16ms
63ms
2.8MB
(The LoC counts for Simple_markup and Pandoc are split into parsing and HTML generation.)
I didn't do any attempt to optimize Simple_markup beyond replacing a single O(n^2) call to String.nsplit with a O(n) Str.split one in order to split the input string into lines. I'm not compiling with -unsafe or -nodynlink either.
To add insult to injury, Bluecloth, markdown and python-markdown are ugly hacks that boil down to iterated regexp-based gsubs. I can see why they have a long history of bugs: it is easy for a gsub pass to interfere accidentally with another, and such regexp-based transformations are full of corner cases.
A much cleaner approach is to parse the markup into a parse tree, and then generate the (X)HTML in a separate pass. This is what Pandoc, discount and Simple_markup do.