-
Notifications
You must be signed in to change notification settings - Fork 90
rs_encode_msg has inconsistent behavior depending on input types and c_exp
value
#86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you very much for your detailed report on this issue. V2 is exactly meant to change the API to a more consistent behavior. It already implements some breaking changes. And I did not release v2 stable yet because I wanted to implement more breaking changes. So please go ahead with the PR, please implement what would produce the most consistent behavior while keeping the most functionality (backward compatibility should not be counted as a functionality). More precisely, reedsolo was first made with Python 2, so this is why it mixes strings and bytearrays. In fact, I implemented bytearrays later, as a way to optimize speed and produce a more consistent output. However, this library is still meant to the general public, including developers who have no background in information theory nor communication channels and they just want to protect their strings of characters. So I would support option 2, but with a facade interface added in the RSCodec object to allow using strings as before. This way, we would move this behavior out of the core logic and into a method of RSCodec that is meant specifically for this. What do you think about this small extension of your 2nd solution ? |
Also I think the current representation of the symbols is a mess. It was the best we had at the time I think but now it's possible to do much better. But I'm not up to date at all with modern handling of this type of data, so if you have a clear idea on how to do so, please feel free to revamp this all, i mean you are free to do very breaking api changes as long as the unit tests pass (or they can be tweaked to pass). I am all for more consistent and reproducible behavior. |
Ah and also this may help a lot with handling c_exp smaller than 256 or greater. The lib has always been optimized for this case but theri are several odher issues about supporting other cases and I tried myself but it seems the logic conditions are not always consistent as you point out so this may be part of why it doesn't work at expected. |
Sounds good to me. :) Are you okay removing python 2 compatibility code in areas I touch? or should I keep it theoretically compatible just in case? |
re: representing symbols, in some of my work I'm working in "funny" finite fields (i.e. not 2^c). since any GF is isomorphic to F_{p^n} I think keeping symbols as integral types is optimal. As for packing them, you can get tricky to optimize space efficiency, but given the design of modern computers keeping them byte-aligned is usually fastest, and the memory wasted by leaving a few bits zero in every word is not so bad. I suspect using numpy array types would be faster and much easier to vectorize but it wouldn't be consistent with keeping things independent of libraries, which I think is a nice feature that allows reedsolo to be used in many contexts. |
Yes, explicit encoding should be required, and an exception raised otherwise, because I am aware as I'm sure you are that there is no way to assume a correct encoding for all cases ;-) Latin-1 was chosen initially because the library originally supported only 2^8 galois fields, so that each symbol could not go beyond 256 values, hence latin-1 was encoding all possible symbols. Now that the module supports other fields bigger than latin-1, other encodings may be better suited. Also IIRC this still worked for utf-8 strings because reedsolo splits the string in bytes, so they cannot technically have a bigger value than 256 (under gf 2^8). So this still allowed to protect utf-8 strings, at the cost sometimes of having a utf-8 character split between two different chunks (because one byte could be at the end of one chunk and the next byte at the start of the next chunk). This was a simplification that fitted 80% of needs imho, but I'm not sure this is still pertinent nowadays. Anyway we can specify in the readme or even the exception message to use latin-1 to restore this old behavior. In general, I favor improvements to consistency and generalizability. API breaking is not a consideration for major branch version change such as v1 -> v2. The only constraints are 1. Remain pure python, 2. Remain as self-contained as possible, but external pure python libraries may be ok if very useful or totally optional, 3. Try to avoid changes that drastically reduce speed, and speeding tricks are very welcome as long as they don't reduce generalizability (ie it's a universal reed solomon codec). The ultimate goals behind these "rules" is that I envisioned this module to be as future proof as possible and to work on any device as possible, including embedded ones. The goal is really to provide a rs module that works everywhere, for as long as possible in the future, and without being too slow. So this excludes numpy unfortunately although we could have an optional numpy based module, just like we have a Cython one. In this case, I would suggest to use Galois, a numpy based extension that is more efficient for galois fields : https://github.com/mhostetter/galois -- a contributor did an implementation a few years ago but I never managed to find the time to merge it in. About funny galois fields, I can't remember but I think I tried to implement them too, so 3^9 is supported IIRC for example, but I can't remember if they work as expected, I think that I only found documentation about binary galois fields so my implementation for "funny" galois fields was out of my own mathematical intuitions. I would be very grateful if you could check if this works as expected since you are experienced with them ! |
About the implementation using the galois extension over numpy, here it is, and it also implements a sort of parallel decoding, apparently much faster than the current reedsolo pure python or cython implementations : https://github.com/raeudigerRaeffi/reedsolomon/tree/preedsolo If this interests you, please feel free to have a look. It seems to me you understand both the codebase and the theoretical underpinnings very well, so I trust your feedback. I will do my best to follow up in a timely basis to clean up and merge any of your contributions. |
Uh oh!
There was an error while loading. Please reload this page.
Version: >=2.1.x, python native.
issue
rs_encode_msg can have inconsistent behavior.
when
c_exp<=8
it usesbytearray
's interpretation of the data to convert it into bytes. This seems to do the "right thing" in most cases as bytes are the smallest unit data can generally be divided into. It does however fail when giving a string as there is no encoding passed.when
c_exp>8
,array.array()
is used and some type checks are used to decide what to do:"latin-1"
encoding.bytes
objects are broken up byte by byte even ifc_exp>=16
and symbols can hold multiple bytes.bytearray
s get the default behavior ofarray.array('i', obj)
and are therefor packed in groups of 4 bytes per symbol, which causes an error unlessc_exp=32
. (note 'i' on the anaconda distribution of windows 64bit python 3.12 is 4 bytes, but may be 2 bytes on other platforms/compilers)comments/opinions
I think the string encoding of "latin-1" is not a good choice for python 3 users. The code comments imply that UTF8 "mangles" data but I assume that's a vestigial opinion from python 2 where binary data was often held in strings. These days with python 3 i think UTF8 is the correct default behavior for encoding actual string, especially given the growth of non-english text in computing in the last two decades.
i was alarmed to find
bytes
andbytearray
s treated differently as inputs whenc_exp>8
. Thankfully, use ofbytearray
s almost always crashes (whenc_exp>8
so it doesn't cause runtime errors.Solutions
The way I see it, there's a few options
bytes
, and arrays/lists of integral types (includingbytearray
, numpy arrays) and treating each element as a symbol. IMO Users should be forced to explicitly interpret stings, which is consistent with how python works.bytearray
work, and do the same thing for for all values ofc_exp
.I'm happy to implement any of these changes and make a pull request. But I want to leave API decisions to @lrq3000 as he is the maintainer.
The text was updated successfully, but these errors were encountered: