[DRAFT] UCS-2 needs to be UTF-16 now #68

msdemlei · 2025-06-11T11:42:08Z

UCS-2 as such is basically not implemented anywhere any more. It's all UTF-16, and I say we need to acknowledge that.

Regrettably, the variable-length encoding of UTF-16 won't work for us because we need fixed lengths für the strings in VOTable BINARY2. That's why I have a TODO in here.

We could require parsers to read the UTF-16 strings and identify surrogate pairs, but that would be terrible in all ways.

To get out of this fix, we could say that arraysize represents the encoded length rather than the number of unicode codepoints. I think I'd consider that reasonable.

Alternatively, we say "you can't have non-BMP characters in unicodeChar and hence no surrogate pairs. VOTable parsers must fail when they are asked to encode anything outside of the BMP or containing surrogate characters". Hm 💩. For clarity, let me stress that basically all emojis are outside of the BMP.

See also https://wiki.ivoa.net/internal/IVOA/InterOpJune2025Apps/unicode-notes.pdf and bug #69.

But that won't work easily as we can no longer reliably compute the length of such fields, at least not without parsing them. So, there's a TODO in here. See also https://wiki.ivoa.net/internal/IVOA/InterOpJune2025Apps/unicode-notes.pdf

msdemlei added 2 commits June 11, 2025 13:13

Preparing for WD-1.6

578f97f

Defining UCS-2 as UTF-16 now.

b93f84b

But that won't work easily as we can no longer reliably compute the length of such fields, at least not without parsing them. So, there's a TODO in here. See also https://wiki.ivoa.net/internal/IVOA/InterOpJune2025Apps/unicode-notes.pdf

msdemlei mentioned this pull request Jun 11, 2025

VOTable 1.5 still mentions UCS-2 #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] UCS-2 needs to be UTF-16 now #68

[DRAFT] UCS-2 needs to be UTF-16 now #68

Uh oh!

Uh oh!

Uh oh!

[DRAFT] UCS-2 needs to be UTF-16 now #68

Are you sure you want to change the base?

[DRAFT] UCS-2 needs to be UTF-16 now #68

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!