-
Notifications
You must be signed in to change notification settings - Fork 35
Endian-neutrality #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Having looked over the code, the binary formats are simply not portable, even across architectures of the same endianness. |
Is the only additional issue besides endianness datatype size? If so, I would be prepared to write a portable mode for the optimized-lookup conversion process and ospell reading. |
4 things that I've found:
One can solve endianness, precise width, and signed values together by using ZigZag + Varint encoding, similar to what Google Protocolbuffers does. Doing so prevents memory mapping, though compression also does so I doubt that's a concern. Floating point does not have a binary portable format, so it must be decomposed/recomposed as two integers with e.g. But, since this will break the binary formats everywhere anyway, Flammie had a wishlist of features to add to the format, and I'm sure others have desired fixes as well. |
Started https://github.com/hfst/hfst/tree/portable branch to deal with this. Goal is to read and write a new portable 4.0 format, while retaining reading of old 3.0 format, but dropping 2.0 format. This is a major change, because even the very first header length is written in a non-portable fashion, so it all snowballs from that point on. Changing to using 64bit varint with zigzag encoding everywhere, and decomposing floating point as needed. |
About memory-mapping: that feature is crucial for effective use of hfst-ospell-based spellers on mobile phones, especially since fst size can grow beyond os-wide memory use limits in e.g. iOS. There is a fork of hfst-ospell with memory-mapping functionality at https://github.com/bbqsrc/hfst-ospell. |
I have a somewhat simpler proposal, at least for ospell's needs. The optimized-lookup format doesn't need signed integers, and I think we could quite simply detect endianness and flip byte order on reading and writing accordingly. So in effect standardize on
I have the changes ready for the optimized-lookup part, we'd only need to support this in the HFST3 header length field. What do you think? |
In effect I mean a function like bool is_big_endian(void) and writing/reading functions that are used depending on is_big_endian() and look like void write_uint16_flipping_endianness(uint16_t val, std::ostream& os) void write_uint32_flipping_endianness(uint32_t val, std::ostream& os) void write_bool_as_uint32(bool val, std::ostream& os) void write_float_flipping_endianness(float val, std::ostream& os) |
Also: I think it would be reasonable to go always-weighted. |
That is exactly what I am doing, but absolutely not using full 32 bit for a bool. 8 bit is enough. Also not detecting endianness at runtime - we have autoconf But, floating point is very dangerous to serialize as-is. Will have to check all the platforms to see if their in-memory representation is directly usable. If not, just multiply by 10000 and store as uint64_t. |
The nice thing about using 32 bits for bools is that it preserves compatibility with existing binary transducers. There are 9 bools in the header, so it's only a waste of 216 bits. I think is_big_endian() will probably be a compile-time constant, but ok. I was under the impression that in modern processors floating point serialization like that should not be a problem, but I may be wrong.. |
But do you still think we need a new 4.0 format rather than just interoperate with 3.0? After all, optimized-lookup is the only binary format under our own control.. |
There is a desire to make the header more flexible and to make it generally easier mmap'able, plus that all existing big endian files will be thrown under the bus regardless of how minimal the changes are. As for float formats, these are the platforms we need to support at minimum: https://www.debian.org/ports/#portlist-released |
Okay. When you say "the header", are you talking about the HFST3 header, the optimized-lookup header or both? Are you planning on getting rid of the separate optimized-lookup header? |
The HFST3 header. The very first length is written in a dependent way, and Flammie had issues with it being limited to 16 bit. |
Okay. And for the optimized-lookup header, is the only change you're planning to make bools 8-bit instead of 32-bit? Perhaps we should encode them all in two bytes if we're trying to save space? |
If that makes sense, sure. I didn't even look at where the bools were coming from when I made my first pass - I just saw them being written as 32bit and changed it. |
I'm just still not completely sold on the need for varints + zigzag. The only thing in the HFST3 header with binary data is the header length field, and that's unsigned. If it needs to be bigger, why not just make it bigger? Other numbers can be text-encoded in the header properties, right? It's already quite flexible in that sense. Current HFST3 files look like:
Then come implementation-specific headers, which for optimized-lookup look like
Then comes the alphabet, which is utf-8 strings separated by null bytes, and then the tables. Do you have a proposal for what the HFST4 header will look like? And do you see the optimized-lookup part of staying as it is apart from the bools? |
Ah, part of this was discussed on IRC. varint is out, because that's not mmap-compatible. So yes, it will be a mostly minimal change that still uses uint16_t and uint32_t. The HFST4 header would look the same, just be HFST4 + portable uint32_t length to distinguish it from native HFST3 header. That's pretty much the only way to allow loading old big endian files, since ungetc() can only push back 1 byte. The OL header doesn't need to change, I guess, but uint16_t for symbols looks like a limitation given that Unicode is 21 bits. |
You don't even need that. https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html uint32_t read_beu32(const uint8_t *buf)
{
return buf[3] | (buf[2]<<8) | (buf[1]<<16) | (buf[0]<<24);
} into e.g. an amd64 |
HFST's binary formats are not endian-neutral, which is a quite big issue. I'm looking into how to best fix this, which may result in a breaking change of the binary formats. I've got a PowerPC chroot that can reproduce the problem.
https://bugs.debian.org/827199
https://buildd.debian.org/status/logs.php?pkg=hfst
The text was updated successfully, but these errors were encountered: