I ran across this page in my aggregator before lunch, and flagged it for followup. That time has come - it must be testing day on this blog. Here's what jumped at me:
I give up on libxml for the time being, and think instead of Chris Petrilli’s comment that ruby (and python) performance is “not quite in the league of Smalltalk (or Lisp, likely), which have extremely mature VMs with on-the-fly compilation and optimization”. Is Smalltalk then much faster than python or ruby, or comparable with C, for the task of parsing moderately large XML files?
No. Time to load and parse my iTunes library file, an 11mb Apple plist, on a 1 GHz G4 Powerbook with VisualWorks Non-Commercial 7.3.1: about three minutes.
That didn't seem right - I use the XML code in VW extensively, so I'm pretty familiar with it. I grabbed my iTunes file (only 2.7 MB) and parsed that - took 5.5 seconds. Well, the two caveats are, that's a smaller file, and my hardware isn't his hardware. With that in mind, I went ahead and created a large XML file. I grabbed the default feed file for BottomFeeder, and saved it as an XML feed list instead of as a binary dump - like this:
file := Tools.XMLConfigFileSupport.XMLConfigFile
filename: 'g:\vw74\image\feeds.xml'.
file saveObject: RSSFeedManager default subscribedFeedsFolder.
file saveConfiguration
That just dumps the 80 sample feeds into a (pretty verbose) XML format - I ended up with a 13 MB file. That seemed large enough, so I tried the parse on that:
content := 'feeds.xml' asFilename contentsOfEntireFile.
parser := XMLParser new.
parser validate: false.
Time millisecondsToRun: [parser parse: content readStream]
That last line times the execution - it ran in 17.9 seconds. Not a couple of seconds, but not 3 minutes, either. There was some GC going on during that, so I'm sure that things could be improved by simply configuring VW with a larger bite of old space up front - in dealing with large amounts of data, a fair bit of time is going to be chewed up either in allocating more memory, or GC'ng if we hit the current limits (as per the memory policy in place).
For this kind of parse to take 3 minutes, either the hardware would have to be very slow, or memory limits would have to be set badly for dealing with larger files. I'm not entirely sure what was going on.