Ruby Forum IronRuby > Bytes or Characters?

Posted by Charles Oliver Nutter (Guest)
on 08.08.2008 00:30
(Received via mailing list)
Hey, I'm curious how IronRuby is handling the bytes versus characters
issue for Ruby strings. JRuby currently only has byte[]-based strings, a
decision we made mostly for Ruby performance. But it has obvious
implications for calling Java code, since we need to decode and encode
the byte[] to char[] on the way in and out. Ultimately the decision to
use byte[]-based strings was the right one, since so much of Ruby
expects byte counts and uses String as a generic byte bucket. But more
and more we've started to consider ways to hybridize String so it's
characters when we want it to be and bytes otherwise.

So, what does IronRuby do?

- Charlie
Posted by Tomas Matousek (Guest)
on 08.08.2008 02:49
(Received via mailing list)
We have a hybrid representation that converts content lazily as needed. 
The code that's currently checked in is a basic implementation I coded 
in a day before RailsConf so it is pretty basic, is not tested 
thoroughly and has bunch of bugs I already know about. I'm working on 
some improvements right now.

Here's the checkin comment that explains briefly how it works. Note that 
some details are subject to change:

A new implementation for Ruby MutableString and Ruby regular expression 
wrappers.
This is just the first pass, w/o optimizations and w/o encodings 
(Default system encoding is used for all strings).
Many improvements and adjustments will come in future, some hacks will 
be removed.

Basic architecture:
MutableString holds on Content and Encoding. Content is an abstract 
class that has three subclasses:
1)      StringContent
-       Holds on an instance of System.String - an immutable .NET 
string. This is the default representation for strings coming from CLR 
methods and for Ruby string literals.
-       A textual write operation on the mutable string that has this 
content representation will cause implicit conversion of the 
representation to StringBuilderContent.
-       A binary read/write operation triggers a transition to 
BinaryContent using the Encoding stored on the owning MutableString.

2)      StringBuilderContent
-       Holds on an instance of System.Text.StringBuilder - a mutable 
Unicode string.
-       A binary read/write operation transforms the content to 
BinaryContent representation.
-       StringBuilder is not optimal for some operations (requires 
unnecessary copying), we may consider to replace it with resizable 
char[].

3)      BinaryContent
-       A textual read/write operation transforms the content to 
StringBuilderContent representation.
-       List<byte> is currently used, but it doesn't fit many operations 
very well. We should replace it by resizable byte[].

The content representation is changed based upon operations that are 
performed on the mutable string. There is currently no limit on number 
of content type switches, so if one alternates binary and textual 
operations the conversion will take place for each one of them. Although 
this shouldn't be a common case we may consider to add some counters and 
keep the representation binary/textual based upon their values.

The design assumes that the nature of operations implemented by library 
methods is of two kinds: textual and binary. And that data that are once 
treated as text are not usually treated as raw binary data later. Any 
text in the IronRuby runtime is represented as a sequence of 16bit 
Unicode characters (standard .NET representation). Each binary data 
treated as text is converted to this representation, regardless of the 
encoding used for storage representation in the file. The encoding is 
remembered in the MutableString instance and the original representation 
could be always recreated. Not all Unicode characters fit into 16 bits, 
therefore some exotic ones are represented by multiple characters 
(surrogates). If there is such a character in the string, some 
operations (e.g. indexing) might not be precise anymore - the n-th item 
in the char[] isn't the n-th Unicode character in the string. We believe 
this impreciseness is not a real world issue and is worth performance 
gain and i
 mplementation simplicity.

Tomas
Posted by Charles Oliver Nutter (Guest)
on 08.08.2008 23:09
(Received via mailing list)
Tomas Matousek wrote:
> The content representation is changed based upon operations that are performed on the mutable string. There is currently no limit on number of content type switches, so if one alternates binary and textual operations the conversion will take place for each one of them. Although this shouldn't be a common case we may consider to add some counters and keep the representation binary/textual based upon their values.

Ok, so what constitutes a binary operation and what consitutes a textual
operation? It seems like the potential for ping-ponging between the two
representations would be a serious risk. And largely that's why we ended
up going with a single representation, since so many APIs did pass
String around, manipulate them, index specific characters, write them
through some stream to somewhere else, and repeat.

If course if the ping-pong isn't bad there could probably be some
formalized list of rules. Such a set of "binary" operations and
"textual" operations would be useful to JRuby and MacRuby, in addition
to IronRuby.

Here's an example we ran into, however: regexp matching against binary
content. I know of at least one library that uses regexp to parse out a
binary file header. How would this work under IronRuby? Also, there's
the concern about conversion from binary to text at inopportune moments,
which could for example corrupt binary content that could not be decoded
into valid UTF-16 characters. In our case, long ago, we represented all
such binary content as "plain-encoded" UTF-16 with only the low byte
set, but that obviously wasn't a whole lot better than just using bytes,
and it was additionally way slower.

I imagine this would also impact copy-on-write capabilities too, yes?
Since there would be operations that could completely change the backing
store of a string.

> The design assumes that the nature of operations implemented by library methods is of two kinds: textual and binary. And that data that are once treated as text are not usually treated as raw binary data later. Any text in the IronRuby runtime is represented as a sequence of 16bit Unicode characters (standard .NET representation). Each binary data treated as text is converted to this representation, regardless of the encoding used for storage representation in the file. The encoding is remembered in the MutableString instance and the original representation could be always recreated. Not all Unicode characters fit into 16 bits, therefore some exotic ones are represented by multiple characters (surrogates). If there is such a character in the string, some operations (e.g. indexing) might not be precise anymore - the n-th item in the char[] isn't the n-th Unicode character in the string. We believe this impreciseness is not a real world issue and is worth performance gain and
 i
>  mplementation simplicity.

I guess one obvious question here would be supporting multiple
encodings, as in Ruby 1.9. With a byte[]-based string and JOni
(Oniguruma port) it shouldn't be too difficult to add 1.9 string logic
into JRuby. But it seems like it would be harder if we put in place the
same rules you have for converting text into the platform's preferred
format under certain circumstances.

- Charlie