This post originated from an RSS feed registered with Ruby Buzz
by Red Handed.
Original Post: Mucking With Unicode for 1.8
Feed Title: RedHanded
Feed URL: http://redhanded.hobix.com/index.xml
Feed Description: sneaking Ruby through the system
The idea here with this little project is to enhance the strings in Ruby 1.8 to support encodings. Following the plan of Matz and without breaking extensions and still allowing raw strings.
For now, you have to specify the encoding when you create the string:
I can’t use wchar, since I’m adding onto the RString class, which stores the raw bytes in RSTRING(str)->ptr. And I’ve got to hook into Ruby’s regexps, can’t ignore that. So, instead, I’ve added an indexed array of character lengths. I’m not suggesting this is the answer, but consider that we have so little out there. When the string initially gets stored, it gets validated against the rules for the encoding and all the character sizes get stored.
The index_s method gives a list of the byte sizes for each character. I only support UTF-8 presently.
The speed is pretty good. Creating new strings, adding string and dup’ing strings end up being generally just as fast as builtin strings. Substrings and slicing don’t compare, though. But not much additional memory is used. One 4-byte index is used for every 16 characters. So, it’s about 20-25% over the raw string.
The repository is here. I could use some help finding a replacement for bitcopy, which is like a memcpy with bit offsets. The one I’m using is fast but buggy.