This post originated from an RSS feed registered with Ruby Buzz
by Red Handed.
Original Post: Closing in on Unicode with Jcode
Feed Title: RedHanded
Feed URL: http://redhanded.hobix.com/index.xml
Feed Description: sneaking Ruby through the system
Patrick Hall has a great article on using the Jcode module for Ruby, which provides a more natural support for hacking Unicode strings. He has a few simple unit tests that illustrate failings in the Jcode library and leaves right there for us to glare at.
def test_reverse
s = "ÎαλημÎÏα κÏÏμε!"
srev = s.reverse
assert_equal(s,srev) # fails
end
def test_index
# String#index isn't Unicode-aware, it's counting bytes
# there are ways aorund this, but...
s = "ÎαλημÎÏα κÏÏμε!"
assert_equal(0, s.index('Î')) # passes
assert_equal(1, s.index('α')) # fails!
assert_equal(3, s.index('α')) # passes; 3rd byte!
end
Sure, we’ll have all the answers in the future, but, for now, I’d say some patches to Jcode are in order. Or, to spirit up some Python mimickry:
class UString < String
# Show u-prefix as in Python
def inspect; "u#{ super }" end
# Count multibyte characters
def length; self.scan(/./).length end
# Reverse the string
def reverse; self.scan(/./).reverse.join end
end
module Kernel
def u( str )
UString.new str.gsub(/U\+([0-9a-fA-F]{4,4})/u){["#$1".hex ].pack('U*')}
end
end
str = u"Ruby-èª"
str.length #=> 6
str.reverse #=> u"èª-ybuR"
Anyway, Patrick’s blog is a great tour through easy digestable tidbits about Unicode. (Thanks, Jonas!)