Problems with unicode strings
Reported by Martin Skinner | August 31st, 2009 @ 03:28 AM
Unicode javascript strings are not transferred to ruby correctly. Here's an example. I create a javascript string consisting of a single Euro-Sign (see http://www.fileformat.info/info/unicode/char/20ac/index.htm)
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'johnson'
=> true
irb(main):007:0> s = Johnson.evaluate("'\\u20AC'")
=> "\254"
In ruby, we're getting a single byte with the value 254 (octal),
which is 172 decimal, or 0xAC. So it looks like we're only getting
the low-byte of our 16-bit Unicode char. After scanning the Johnson
code, I think I found the culprit - JS_GetStringBytes returns the bytes
of a Unicode-16 String by stripping off the
high-bytes.
Note that for non-ASCII strings, if JS_CStringsAreUTF8 is false, these functions can return a corrupted copy of the contents of >the string. Use JS_GetStringChars to access the 16-bit characters of a JavaScript string without conversions or copying.
A similar problem probably exists in the other direction (ruby -> js) too.
I suggest trying JS_CStringsAreUTF8 (which may solve both problems). If this fails, then johnson would have to extract the Unicode-16 chars from spidermonkey and convert them to a ruby-friendly encoding.
No comments found
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป