rs_encode_msg has inconsistent behavior depending on input types and `c_exp` value #86

christakahashi · 2025-05-05T20:23:19Z

Version: >=2.1.x, python native.

issue

rs_encode_msg can have inconsistent behavior.

when c_exp<=8 it uses bytearray's interpretation of the data to convert it into bytes. This seems to do the "right thing" in most cases as bytes are the smallest unit data can generally be divided into. It does however fail when giving a string as there is no encoding passed.

when c_exp>8, array.array() is used and some type checks are used to decide what to do:

strings are converted to bytes with the "latin-1" encoding.
bytes objects are broken up byte by byte even if c_exp>=16 and symbols can hold multiple bytes.
bytearrays get the default behavior of array.array('i', obj) and are therefor packed in groups of 4 bytes per symbol, which causes an error unless c_exp=32. (note 'i' on the anaconda distribution of windows 64bit python 3.12 is 4 bytes, but may be 2 bytes on other platforms/compilers)

comments/opinions

I think the string encoding of "latin-1" is not a good choice for python 3 users. The code comments imply that UTF8 "mangles" data but I assume that's a vestigial opinion from python 2 where binary data was often held in strings. These days with python 3 i think UTF8 is the correct default behavior for encoding actual string, especially given the growth of non-english text in computing in the last two decades.

i was alarmed to find bytes and bytearrays treated differently as inputs when c_exp>8. Thankfully, use of bytearrays almost always crashes (when c_exp>8 so it doesn't cause runtime errors.

Solutions

The way I see it, there's a few options

Do nothing. the api is the way it is and keeping strict compatibility with 1.x versions is a priority (I haven't confirmed 1.7 has the same behavior). Add documentation clarifying behavior.
Since version 2 hasn't seen a full release yet it seems acceptable to update the API (especially since this is undocumented behavior). I'd propose only accepting bytes, and arrays/lists of integral types (including bytearray, numpy arrays) and treating each element as a symbol. IMO Users should be forced to explicitly interpret stings, which is consistent with how python works.
something in between. make both strings and bytearray work, and do the same thing for for all values of c_exp.

I'm happy to implement any of these changes and make a pull request. But I want to leave API decisions to @lrq3000 as he is the maintainer.

The text was updated successfully, but these errors were encountered:

lrq3000 · 2025-05-05T23:14:37Z

Thank you very much for your detailed report on this issue.

V2 is exactly meant to change the API to a more consistent behavior. It already implements some breaking changes. And I did not release v2 stable yet because I wanted to implement more breaking changes.

So please go ahead with the PR, please implement what would produce the most consistent behavior while keeping the most functionality (backward compatibility should not be counted as a functionality).

More precisely, reedsolo was first made with Python 2, so this is why it mixes strings and bytearrays. In fact, I implemented bytearrays later, as a way to optimize speed and produce a more consistent output.

However, this library is still meant to the general public, including developers who have no background in information theory nor communication channels and they just want to protect their strings of characters.

So I would support option 2, but with a facade interface added in the RSCodec object to allow using strings as before. This way, we would move this behavior out of the core logic and into a method of RSCodec that is meant specifically for this.

What do you think about this small extension of your 2nd solution ?

lrq3000 · 2025-05-05T23:20:25Z

Also I think the current representation of the symbols is a mess. It was the best we had at the time I think but now it's possible to do much better. But I'm not up to date at all with modern handling of this type of data, so if you have a clear idea on how to do so, please feel free to revamp this all, i mean you are free to do very breaking api changes as long as the unit tests pass (or they can be tweaked to pass). I am all for more consistent and reproducible behavior.

lrq3000 · 2025-05-05T23:22:44Z

Ah and also this may help a lot with handling c_exp smaller than 256 or greater. The lib has always been optimized for this case but theri are several odher issues about supporting other cases and I tried myself but it seems the logic conditions are not always consistent as you point out so this may be part of why it doesn't work at expected.

christakahashi · 2025-05-06T01:29:07Z

Sounds good to me. :)
Do you think the RScodec() user should be required to pass in an encoding when passing in a string? I think they should get an error if they don't provide an encoding but that's just my opinion. my second preference is to keep 'latin-1' as the default since after looking into it I see that python errors out if it encounters codepoints >255. It's a bit western-centric but seems to preserve the case where binary ends up in a string somehow, and will error out if it encounters unicode>255.

Are you okay removing python 2 compatibility code in areas I touch? or should I keep it theoretically compatible just in case?

christakahashi · 2025-05-06T01:45:55Z

re: representing symbols, in some of my work I'm working in "funny" finite fields (i.e. not 2^c). since any GF is isomorphic to F_{p^n} I think keeping symbols as integral types is optimal. As for packing them, you can get tricky to optimize space efficiency, but given the design of modern computers keeping them byte-aligned is usually fastest, and the memory wasted by leaving a few bits zero in every word is not so bad. I suspect using numpy array types would be faster and much easier to vectorize but it wouldn't be consistent with keeping things independent of libraries, which I think is a nice feature that allows reedsolo to be used in many contexts.

lrq3000 · 2025-05-06T13:17:56Z

Yes, explicit encoding should be required, and an exception raised otherwise, because I am aware as I'm sure you are that there is no way to assume a correct encoding for all cases ;-)

Latin-1 was chosen initially because the library originally supported only 2^8 galois fields, so that each symbol could not go beyond 256 values, hence latin-1 was encoding all possible symbols. Now that the module supports other fields bigger than latin-1, other encodings may be better suited.

Also IIRC this still worked for utf-8 strings because reedsolo splits the string in bytes, so they cannot technically have a bigger value than 256 (under gf 2^8). So this still allowed to protect utf-8 strings, at the cost sometimes of having a utf-8 character split between two different chunks (because one byte could be at the end of one chunk and the next byte at the start of the next chunk). This was a simplification that fitted 80% of needs imho, but I'm not sure this is still pertinent nowadays. Anyway we can specify in the readme or even the exception message to use latin-1 to restore this old behavior.

In general, I favor improvements to consistency and generalizability. API breaking is not a consideration for major branch version change such as v1 -> v2. The only constraints are 1. Remain pure python, 2. Remain as self-contained as possible, but external pure python libraries may be ok if very useful or totally optional, 3. Try to avoid changes that drastically reduce speed, and speeding tricks are very welcome as long as they don't reduce generalizability (ie it's a universal reed solomon codec).

The ultimate goals behind these "rules" is that I envisioned this module to be as future proof as possible and to work on any device as possible, including embedded ones. The goal is really to provide a rs module that works everywhere, for as long as possible in the future, and without being too slow.

So this excludes numpy unfortunately although we could have an optional numpy based module, just like we have a Cython one. In this case, I would suggest to use Galois, a numpy based extension that is more efficient for galois fields : https://github.com/mhostetter/galois -- a contributor did an implementation a few years ago but I never managed to find the time to merge it in.

About funny galois fields, I can't remember but I think I tried to implement them too, so 3^9 is supported IIRC for example, but I can't remember if they work as expected, I think that I only found documentation about binary galois fields so my implementation for "funny" galois fields was out of my own mathematical intuitions. I would be very grateful if you could check if this works as expected since you are experienced with them !

lrq3000 · 2025-05-06T13:22:33Z

About the implementation using the galois extension over numpy, here it is, and it also implements a sort of parallel decoding, apparently much faster than the current reedsolo pure python or cython implementations :

https://github.com/raeudigerRaeffi/reedsolomon/tree/preedsolo

If this interests you, please feel free to have a look. It seems to me you understand both the codebase and the theoretical underpinnings very well, so I trust your feedback. I will do my best to follow up in a timely basis to clean up and merge any of your contributions.

lrq3000 mentioned this issue May 6, 2025

Bytes vs. Symbols #80

Open

lrq3000 assigned lrq3000 and christakahashi May 6, 2025

lrq3000 added bug enhancement labels May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rs_encode_msg has inconsistent behavior depending on input types and `c_exp` value #86

rs_encode_msg has inconsistent behavior depending on input types and `c_exp` value #86

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rs_encode_msg has inconsistent behavior depending on input types and c_exp value #86

rs_encode_msg has inconsistent behavior depending on input types and c_exp value #86

Comments

Uh oh!

issue

comments/opinions

Solutions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rs_encode_msg has inconsistent behavior depending on input types and `c_exp` value #86

rs_encode_msg has inconsistent behavior depending on input types and `c_exp` value #86