This post originated from an RSS feed registered with Ruby Buzz
by Andrew Johnson.
Original Post: Oniguruma (Ruby with Demon wheels)
Feed Title: Simple things ...
Feed URL: http://www.siaris.net/index.cgi/index.rss
Feed Description: On programming, problem solving, and communication.
Oniguruma is a
regular expression C library you can use in your own projects under the BSD
license, or you can install it as Ruby’s regular expression engine
(in which case it falls under the Ruby license). Oniguruma may be
translated to English as Demon Wheel (or something along those lines).
Oniguruma is slated to become Ruby’s default regular expression
engine, and Ruby 1.9 already has it included. But you don’t have to
wait to try it out — it is easily incorporated into 1.8* ruby builds
and basically just involves:
1 downloading and unpacking the latest oniguruma sources for Ruby 2
configure oniguruma with your Ruby source directory 3 make oniguruma (which
applies the patches to the Ruby sources) 4 rebuild and test your ruby (make
clean;make;make test) in Ruby directory 5 test oniguruma (make test) in
oniguruma directory
The only danger in doing this is forgetting that oniguruma is not yet
standard Ruby and shouldn’t be a dependency in released code. You
might want to build both a standard ruby and an oni-ruby (or perhaps
guru-ruby).
Oniguruma brings several features to Ruby’s regexen, notably:
positive and negative look-behind
possessive quantifiers (like atomic/independent subexpressions but as
quantifier)
named backreferences
callable backreferences
Look-behind and callable backreferences are probably the main reasons
you’d want to install oniguruma.
Look-Behinds
Look-ahead assertions have been around for some time, in many regular
expression flavors. Look-behind assertions are less prevalent. Oniguruma
brings positive and negative look-behind assertions
((?<=…) and (?<!…) respectively) to
Ruby. Just like look-ahead assertions, these are zero-width assertions
— they match the current position if the assertion about what follows
(look-aheads) or precedes (look-behinds) is true. They do not consume any
part of the string.
Unlike look-ahead assertions, look-behinds must contain fixed-width
patterns which means: no indeterminate quantifiers. However, alternation is
allowed at the top level of the look-behind, and the alternations need not
be of the same fixed width. Capturing is allowed within positive
look-behinds, but not in negative look-behinds (which makes sense).
Callable Backreferences
Callable backreferences give us recursively defined regular expressions,
which allow one to match/extract arbitrarily nested balanced parentheses
(or other delimiters).
# to match a group of nested unescaped parentheses:
re = %r/((?<pg>\((?:\\[()]|[^()]|\g<pg>)*\)))/
s = 'some(stri\)\((()x)(((c)d)e)\))ng'
mt = s.match re
puts mt[1]
==> (stri\)\((()x)(((c)d)e)\))
Difference between Oniguruma and Standard Ruby Regular Expressions
The main behavioral difference I’ve noted between the two regular
expression engines involves capturing with zero-length subexpression
matches. In the following, sruby is standard ruby, and
oruby is compiled with oniguruma:
In my mind, with nested capturing such as this I would expect that the
contents of $2 and $3 would be substrings (even if only
empty strings) of $1 — like Perl handles it. However, Ruby
isn’t alone in that Python and the pcre both handle it as Ruby does.
If this behavior doesn’t seem strange, consider this more obvious
example:
Here, Oniguruma sides with Perl instead of Ruby, and all the captured
subexpressions are the empty string. However, pcre agrees with standard
Ruby on this one, and Python won’t even compile the regular
expression.
Versions used in testing:
sruby => Ruby 1.8.4 (2006-01-21)
oruby => Ruby 1.8.4 (2006-01-21) with Oniguruma 2.5.2
Perl 5.8.7
Python 2.4
pcre 6.3