8000 Endian-neutrality · Issue #328 · hfst/hfst · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Endian-neutrality #328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TinoDidriksen opened this issue Jun 15, 2016 · 19 comments
Open

Endian-neutrality #328

TinoDidriksen opened this issue Jun 15, 2016 · 19 comments
Assignees

Comments

@TinoDidriksen
Copy link
Member

HFST's binary formats are not endian-neutral, which is a quite big issue. I'm looking into how to best fix this, which may result in a breaking change of the binary formats. I've got a PowerPC chroot that can reproduce the problem.

https://bugs.debian.org/827199
https://buildd.debian.org/status/logs.php?pkg=hfst

@TinoDidriksen
Copy link
Member Author

Having looked over the code, the binary formats are simply not portable, even across architectures of the same endianness.

@Traubert
Copy link
Contributor
Traubert commented Jul 1, 2016

Is the only additional issue besides endianness datatype size? If so, I would be prepared to write a portable mode for the optimized-lookup conversion process and ospell reading.

@TinoDidriksen
Copy link
Member Author

4 things that I've found:

  • Endianness
  • Precise width
  • Signed values
  • Floating point

One can solve endianness, precise width, and signed values together by using ZigZag + Varint encoding, similar to what Google Protocolbuffers does. Doing so prevents memory mapping, though compression also does so I doubt that's a concern.

Floating point does not have a binary portable format, so it must be decomposed/recomposed as two integers with e.g. frexp() and ldexp().

But, since this will break the binary formats everywhere anyway, Flammie had a wishlist of features to add to the format, and I'm sure others have desired fixes as well.

@TinoDidriksen
Copy link
Member Author

Started https://github.com/hfst/hfst/tree/portable branch to deal with this. Goal is to read and write a new portable 4.0 format, while retaining reading of old 3.0 format, but dropping 2.0 format.

This is a major change, because even the very first header length is written in a non-portable fashion, so it all snowballs from that point on. Changing to using 64bit varint with zigzag encoding everywhere, and decomposing floating point as needed.

@snomos
Copy link
Member
snomos commented Aug 15, 2016

About memory-mapping: that feature is crucial for effective use of hfst-ospell-based spellers on mobile phones, especially since fst size can grow beyond os-wide memory use limits in e.g. iOS. There is a fork of hfst-ospell with memory-mapping functionality at https://github.com/bbqsrc/hfst-ospell.

@TinoDidriksen TinoDidriksen self-assigned this Aug 15, 2016
@Traubert
Copy link
Contributor

I have a somewhat simpler proposal, at least for ospell's needs.

The optimized-lookup format doesn't need signed integers, and I think we could quite simply detect endianness and flip byte order on reading and writing accordingly. So in effect standardize on

  1. little-endian everything
  2. uint16 for symbols and uint32 for indices (whic 8000 h has been the case in all binary transduces so far)

I have the changes ready for the optimized-lookup part, we'd only need to support this in the HFST3 header length field. What do you think?

@Traubert
Copy link
Contributor

In effect I mean a function like

bool is_big_endian(void)
{
union {
uint32_t i;
int8_t c[4];
} to_check = {0x01000000};
return to_check.c[0] == 1;
}

and writing/reading functions that are used depending on is_big_endian() and look like

void write_uint16_flipping_endianness(uint16_t val, std::ostream& os)
{
union {
uint16_t uint;
int8_t c[2];
} to_reverse;
to_reverse.uint = val;
os.put(to_reverse.c[1]);
os.put(to_reverse.c[0]);
}

void write_uint32_flipping_endianness(uint32_t val, std::ostream& os)
{
union {
uint32_t uint;
int8_t c[4];
} to_reverse;
to_reverse.uint = val;
os.put(to_reverse.c[3]);
os.put(to_reverse.c[2]);
os.put(to_reverse.c[1]);
os.put(to_reverse.c[0]);
}

void write_bool_as_uint32(bool val, std::ostream& os)
{
os.put(val ? 1: 0);
os.put(0);
os.put(0);
os.put(0);
}

void write_float_flipping_endianness(float val, std::ostream& os)
{
union {
float f;
int8_t c[4];
} to_reverse;
to_reverse.f = val;
os.put(to_reverse.c[3]);
os.put(to_reverse.c[2]);
os.put(to_reverse.c[1]);
os.put(to_reverse.c[0]);
}

@Traubert
Copy link
Contributor

Also: I think it would be reasonable to go always-weighted.

@TinoDidriksen
Copy link
Member Author

That is exactly what I am doing, but absolutely not using full 32 bit for a bool. 8 bit is enough. Also not detecting endianness at runtime - we have autoconf AC_C_BIGENDIAN for that.

But, floating point is very dangerous to serialize as-is. Will have to check all the platforms to see if their in-memory representation is directly usable. If not, just multiply by 10000 and store as uint64_t.

@Traubert
Copy link
Contributor

The nice thing about using 32 bits for bools is that it preserves compatibility with existing binary transducers. There are 9 bools in the header, so it's only a waste of 216 bits.

I think is_big_endian() will probably be a compile-time constant, but ok.

I was under the impression that in modern processors floating point serialization like that should not be a problem, but I may be wrong..

@Traubert
Copy link
Contributor

But do you still think we need a new 4.0 format rather than just interoperate with 3.0? After all, optimized-lookup is the only binary format under our own control..

@TinoDidriksen
Copy link
Member Author

There is a desire to make the header more flexible and to make it generally easier mmap'able, plus that all existing big endian files will be thrown under the bus regardless of how minimal the changes are.
So to preserve reading of old big endian files, a new format is needed, which might as well incorporate all the other features and fixes.

As for float formats, these are the platforms we need to support at minimum: https://www.debian.org/ports/#portlist-released

@Traubert
Copy link
Contributor

Okay. When you say "the header", are you talking about the HFST3 header, the optimized-lookup header or both? Are you planning on getting rid of the separate optimized-lookup header?

@TinoDidriksen
Copy link
Member Author

The HFST3 header. The very first length is written in a dependent way, and Flammie had issues with it being limited to 16 bit.

@Traubert
Copy link
Contributor

Okay. And for the optimized-lookup header, is the only change you're planning to make bools 8-bit instead of 32-bit? Perhaps we should encode them all in two bytes if we're trying to save space?

@TinoDidriksen
Copy link
Member Author

If that makes sense, sure. I didn't even look at where the bools were coming from when I made my first pass - I just saw them being written as 32bit and changed it.

@Traubert
Copy link
Contributor

I'm just still not completely sold on the need for varints + zigzag. The only thing in the HFST3 header with binary data is the header length field, and that's unsigned. If it needs to be bigger, why not just make it bigger? Other numbers can be text-encoded in the header properties, right? It's already quite flexible in that sense.

Current HFST3 files look like:

  1. The string "HFST" (including a null byte at the end)
  2. a uint_16 for the length field
  3. another null byte
  4. an even number of utf-8 strings signifying key-value pairs, each separated by a null byte. Usually at least "version" is present, signifying I think the HFST version.

Then come implementation-specific headers, which for optimized-lookup look like

  1. a uint_16 for number of input symbols
  2. a uint_16 for number of all symbols
  3. a uint_32 for size of index table
  4. a uint_32 for size of transition target table
  5. a uint_32 for number of states (not really used)
  6. a uint_32 for number of transitions (not really used)
    11-19) boolean flags, a uint_32 each

Then comes the alphabet, which is utf-8 strings separated by null bytes, and then the tables.

Do you have a proposal for what the HFST4 header will look like? And do you see the optimized-lookup part of staying as it is apart from the bools?

@TinoDidriksen
Copy link
Member Author

Ah, part of this was discussed on IRC. varint is out, because that's not mmap-compatible. So yes, it will be a mostly minimal change that still uses uint16_t and uint32_t.

The HFST4 header would look the same, just be HFST4 + portable uint32_t length to distinguish it from native HFST3 header. That's pretty much the only way to allow loading old big endian files, since ungetc() can only push back 1 byte.

The OL header doesn't need to change, I guess, but uint16_t for symbols looks like a limitation given that Unicode is 21 bits.

@jengelh
Copy link
Contributor
jengelh commented Feb 25, 2025

Also not detecting endianness at runtime - we have autoconf AC_C_BIGENDIAN for that.

You don't even need that. https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html
e.g. current gcc+clang are smart enough to turn

uint32_t read_beu32(const uint8_t *buf)
{
        return buf[3] | (buf[2]<<8) | (buf[1]<<16) | (buf[0]<<24);
}

into e.g. an amd64 bswap instruction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0