Open
Description
In the ED IIIb data from Girsu, the tokenization is not consistent. Examples include:
- udu nita (P221436) vs. udu-nita (P010556)
- ugula ki-siki-ka (P221485) vs. ugula ki siki-ka (P221319)
- ziz2-bala-bi (P020272) vs. ziz2 bala-bi (P355602)
- lu2 esz2 gid2 (P247610) vs. lu2 esz2-gid2 (P221317) vs. lu2-esz2-gid2 (P217545)
- bar-bi gal2-me (P221708) vs. bar-bi-gal2-me (P221331)
- lu2 a kum2 (P221716) vs. lu2-a-kum2 (P221333) vs. lu2 a-kum2 (P221451)
- lu2 e2-sza3-ga-me (P020184) vs. lu2-e2-sza-ga-me (P227557)
- ki-siki-ka me (P221316) vs. ki-siki-ka-me (P221317) vs. ki siki-ka-me (P221319)
A shell script could probably enumerate more examples.
Is there a principled way to decide which tokenizations are correct and harmonize all of the spellings?
Metadata
Metadata
Assignees
Labels
No labels