#52 new
Martin Skinner

Problems with unicode strings

Reported by Martin Skinner | August 31st, 2009 @ 03:28 AM

Unicode javascript strings are not transferred to ruby correctly. Here's an example. I create a javascript string consisting of a single Euro-Sign (see http://www.fileformat.info/info/unicode/char/20ac/index.htm)

irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'johnson'
=> true
irb(main):007:0> s = Johnson.evaluate("'\\u20AC'")
=> "\254"

In ruby, we're getting a single byte with the value 254 (octal), which is 172 decimal, or 0xAC. So it looks like we're only getting the low-byte of our 16-bit Unicode char. After scanning the Johnson code, I think I found the culprit - JS_GetStringBytes returns the bytes of a Unicode-16 String by stripping off the high-bytes.

Note that for non-ASCII strings, if JS_CStringsAreUTF8 is false, these functions can return a corrupted copy of the contents of >the string. Use JS_GetStringChars to access the 16-bit characters of a JavaScript string without conversions or copying.

A similar problem probably exists in the other direction (ruby -> js) too.

I suggest trying JS_CStringsAreUTF8 (which may solve both problems). If this fails, then johnson would have to extract the Unicode-16 chars from spidermonkey and convert them to a ruby-friendly encoding.

No comments found

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป

People watching this ticket