This post originated from an RSS feed registered with Ruby Buzz
by Christian Neukirchen.
Original Post: Pulling from Streams
Feed Title: chris blogs: Ruby stuff
Feed URL: http://chneukirchen.org/blog/category/ruby.atom
Feed Description: a weblog by christian neukirchen - Ruby stuff
It is interesting that in the “real world”, most things go easier into
one direction than in the other. For example, it is easier to break a
coffee cup than to glue it together again. I think this has something
to do with entropy.
Computer programs are not unaffected by this fact, take for example
the two styles of on-the-fly XML
parsing, Push vs. Pull.
Translating a Pull API into a Pull API is done trivially, such code
usually looks like that:
while parser.pull
case parser.state
when :open
receiver.open parser.text
when :text
receiver.text parser.text
when :close
receiver.close
end
end
Doing the reverse—converting a Push API into a Pull one—looks
impossible at first, but that is not true. It can be done (albeit
only in limited ways in Ruby, see below) using our beloved
continuations.
I have written a small class, Stream2Pull that does this crazy task:
class Stream2Pull
attr_reader :data
def initialize(&block)
@data = nil
@continuation = lambda { instance_eval &block }
end
def method_missing(*arguments)
@data = arguments
@again = true
switch
end
def pull
@again = false
switch
@again
end
def switch
save = @continuation
callcc { |@continuation| save.call }
save = nil
GC.start
end
end
Stream2Pull takes a block it will call on itself on the first
execution of pull. This code is expected to send undefined messages
to the instance. Then the “magic” happens: With the call of switch,
the context of the program changes from method_missing to the call of
pull, which then returns. Now the caller can examine the undefined
message—the event—by looking at data. Let’s wrap REXML’s
StreamParser that way:
Now, we can use our new object just like a pull parser:
while s_to_p.pull
p s_to_p.data
end
We just repeatedly pull from the parser, thereby always switching
contexts between the inner loop of the StreamParser and ours.
These are the events that are successfully emitted:
We just have converted a Push API into a Pull API!
Now, we can talk about the dark side of the medal. Ruby’s
continuations are very slow and they eat loads of memory. When I
tried parsing a 1.5 megabyte file—the largest one I could find on my
disk, it is actually the XSL specification—I didn’t yet have the
GC.start inside switch. In about 2 minutes, my Ruby gobbled 400
megabytes of memory! After enforcing garbage-collection, it took less
memory but memory usage was still raising constantly. After 15
minutes and 100 megabytes used I gave up.
These are the things I really want to see in (I hope not too far away)
future Ruby 2 implementations: Efficient tail calls, continuation
passing style, and possibly support for partial
continuations. If you
implement CPS, you can get the others almost for free. Those features
would make above code run at about the speed of a “native” pull
parser. Alternatively, please port the Ruby standard library to
Scheme. }}:-) Just kidding.
If you just want to parse XML, it’s all not too bad, actually. This
is because REXML’s StreamParser *is* actually a wrapper around a
PullParser (REXML::BaseParser)! (Of course, I knew that in advance.)
Therefore, we can easily write:
bp = REXML::Parsers::BaseParser.new(ABOVE_XML)
while bp.has_next?
p bp.pull
end
And it will generate exactly the same output as above. :-) The main
difference is, well, it uses constant memory and only needs 15 seconds
to parse the big XML document. And that’s not too bad for a pure Ruby
implementation.