Ruby Buzz Forum - Mucking With Unicode for 1.8

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Ruby Buzz Forum
Mucking With Unicode for 1.8

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Red Handed

Posts: 1158
Nickname: redhanded
Registered: Dec, 2004

Red Handed is a Ruby-focused group blog.

Mucking With Unicode for 1.8

Posted: Jul 18, 2006 12:56 AM

This post originated from an RSS feed registered with Ruby Buzz by Red Handed.
Original Post: Mucking With Unicode for 1.8 Feed Title: RedHanded Feed URL: http://redhanded.hobix.com/index.xml Feed Description: sneaking Ruby through the system	Latest Ruby Buzz Posts Latest Ruby Buzz Posts by Red Handed Latest Posts From RedHanded

The idea here with this little project is to enhance the strings in Ruby 1.8 to support encodings. Following the plan of Matz and without breaking extensions and still allowing raw strings.

For now, you have to specify the encoding when you create the string:

 >> str = utf8("è²ã¯åã¸ã© æ£ãã¬ãã")
 >> str[1,2]
 => ã¯å
 >> str[/æ£(.{2})/u, 1]
 => ãã¬

I can’t use wchar, since I’m adding onto the RString class, which stores the raw bytes in RSTRING(str)->ptr. And I’ve got to hook into Ruby’s regexps, can’t ignore that. So, instead, I’ve added an indexed array of character lengths. I’m not suggesting this is the answer, but consider that we have so little out there. When the string initially gets stored, it gets validated against the rules for the encoding and all the character sizes get stored.

 >> require 'wordy'
 >> utf8("áááá®áá¡ á¢á§ááá¡ááá").index_s
 => [3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3]

The index_s method gives a list of the byte sizes for each character. I only support UTF-8 presently.

The speed is pretty good. Creating new strings, adding string and dup’ing strings end up being generally just as fast as builtin strings. Substrings and slicing don’t compare, though. But not much additional memory is used. One 4-byte index is used for every 16 characters. So, it’s about 20-25% over the raw string.

The repository is here. I could use some help finding a replacement for bitcopy, which is like a memcpy with bit offsets. The one I’m using is fast but buggy.

Read: Mucking With Unicode for 1.8

Previous Topic

Next Topic


	Web Artima.com