The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
Mucking With Unicode for 1.8

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Red Handed

Posts: 1158
Nickname: redhanded
Registered: Dec, 2004

Red Handed is a Ruby-focused group blog.
Mucking With Unicode for 1.8 Posted: Jul 18, 2006 12:56 AM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Red Handed.
Original Post: Mucking With Unicode for 1.8
Feed Title: RedHanded
Feed URL: http://redhanded.hobix.com/index.xml
Feed Description: sneaking Ruby through the system
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Red Handed
Latest Posts From RedHanded

Advertisement

The idea here with this little project is to enhance the strings in Ruby 1.8 to support encodings. Following the plan of Matz and without breaking extensions and still allowing raw strings.

For now, you have to specify the encoding when you create the string:

 >> str = utf8("色は匂へど 散りぬるを")
 >> str[1,2]
 => は匂
 >> str[/散(.{2})/u, 1]
 => りぬ

I can’t use wchar, since I’m adding onto the RString class, which stores the raw bytes in RSTRING(str)->ptr. And I’ve got to hook into Ruby’s regexps, can’t ignore that. So, instead, I’ve added an indexed array of character lengths. I’m not suggesting this is the answer, but consider that we have so little out there. When the string initially gets stored, it gets validated against the rules for the encoding and all the character sizes get stored.

 >> require 'wordy'
 >> utf8("ვეპხის ტყაოსანი").index_s
 => [3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3]

The index_s method gives a list of the byte sizes for each character. I only support UTF-8 presently.

The speed is pretty good. Creating new strings, adding string and dup’ing strings end up being generally just as fast as builtin strings. Substrings and slicing don’t compare, though. But not much additional memory is used. One 4-byte index is used for every 16 characters. So, it’s about 20-25% over the raw string.

The repository is here. I could use some help finding a replacement for bitcopy, which is like a memcpy with bit offsets. The one I’m using is fast but buggy.

Read: Mucking With Unicode for 1.8

Topic: Ruby and Python Bindings for GEOS Previous Topic   Next Topic Topic: Easy Eclipse 1.0 disponible para su descarga

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use