Ruby Buzz Forum - On the sad state of markdown processors, and getting thousandfold speed-ups.

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Ruby Buzz Forum
On the sad state of markdown processors, and getting thousandfold speed-ups.

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Eigen Class

Posts: 358
Nickname: eigenclass
Registered: Oct, 2005

Eigenclass is a hardcore Ruby blog.

On the sad state of markdown processors, and getting thousandfold speed-ups.

Posted: Apr 7, 2009 5:36 AM

This post originated from an RSS feed registered with Ruby Buzz by Eigen Class.
Original Post: On the sad state of markdown processors, and getting thousandfold speed-ups. Feed Title: Eigenclass Feed URL: http://feeds.feedburner.com/eigenclass Feed Description: Ruby stuff --- trying to stay away from triviality.	Latest Ruby Buzz Posts Latest Ruby Buzz Posts by Eigen Class Latest Posts From Eigenclass

When I started to write the code for the latest incarnation of eigenclass, I was planning to use an existent Markdown processor to generate the HTML for the posts and comments dynamically. That'd take at most a couple lines to pipe the markup to a process and read back the HTML. I took the first Markdown implementation that came to mind, Bluecloth (written in Ruby) and ran it against a few documents. I was most underwhelmed by its speed. It was so slow it'd need over one or two seconds to process some of the entries I've written since. I benchmarked other common implementations, the original markdown (in Perl) and python-markdown, and realized that they were only marginally better. At the risk of being perceived as performance-obsessed, here's the observed performance when processing markdown's README (README.n is README concatenated n times) on a 3GHZ AMD64 box (much faster than the old server running this site):

	language	LoCs (approx.)	README.1 time	README.8 time	README.32 time	README.32 MEM
Bluecloth	Ruby	1100	0.130s	2.16s	30s	31MB
markdown	Perl	1400	0.068s	0.66s	segfault	segfault
python-markdown	Python	1900	0.090s	0.35s	2.06s	23MB
Pandoc	Haskell	900 + 450	0.068s	0.55s	2.7s	25MB

Compare to the rather acceptable performance of my own Simple_markup module in OCaml, and of discount, a C implementation I found when I had already written mine:

	language	LoCs (approx.)	README.8 time	README.32 time	README.32 MEM
Simple_markup	OCaml	313 + 55	12ms	43ms	3.5MB
discount	C	~4500	16ms	63ms	2.8MB

(The LoC counts for Simple_markup and Pandoc are split into parsing and HTML generation.)

I didn't do any attempt to optimize Simple_markup beyond replacing a single O(n^2) call to String.nsplit with a O(n) Str.split one in order to split the input string into lines. I'm not compiling with -unsafe or -nodynlink either.

To add insult to injury, Bluecloth, markdown and python-markdown are ugly hacks that boil down to iterated regexp-based gsubs. I can see why they have a long history of bugs: it is easy for a gsub pass to interfere accidentally with another, and such regexp-based transformations are full of corner cases.

A much cleaner approach is to parse the markup into a parse tree, and then generate the (X)HTML in a separate pass. This is what Pandoc, discount and Simple_markup do.


	Web Artima.com