[Iceberg][Converter] Equality to Position Delete Converter #471

Zyiqin-Miranda · 2025-01-27T22:34:25Z

Converter using Pyiceberg library and Ray cluster compute to be able to convert Iceberg equality deletes to position deletes.

Initial PR merged

Tracking all feature-level TODOs in this issue:

Converter to-be implemented features list, tracking here for future PR references.

P0. Multiple identifier columns, column concatenating + relevant memory estimation change
P0. Verify pos delete written out can be read by Spark, probably done in unit test setup using 2.0 docker
~~P0. Switch to construct equality delete tables using Spark, probably done in a unit test using 2.0 docker~~ Updated: Spark can't write equality deletes, using Pyiceberg to add equality deletes for testing.
~~P0. Any model changes we might need for new 2.0 storage model. eg, only convert certain partition, read “delta”, etc.~~ Updated: Deprioritized, to P2.
P0. Daft sha1 hash support.
P0. Verify correct deduplication based off identify columns if there are multiple records in whether original data files or equality delete files.
P0. Add test cases for two partition specs with bucket transform.

P1. Pyarrow chunked array can not exceed 2GB
P1. Currently, Assuming 1 node can fit one hash bucket for now, adjust parallel data file to download in convert function.
P1. Investigate Pyiceberg replace snapshot committing. Currently, replace snapshot committing self-implemented is not working as expected. Definition of correct should be that we’re able to read the REPLACE snapshot using Spark. So currently reuse the OVERWRITE snapshot committing strategy from Pyiceberg.
P1. Investigate replace snapshot committing using_starting_sequence to avoid conflict. Not entirely sure we need this given that some workaround maybe feasible through internal catalog implementation. So deprioritize for P1.
P1. Merge/Compact small pos delete files support
P1. Spark read pos delete performace. Position delete can correctly be matched to corresponding data files by setting lower_bounds==upper_bounds==file_path even with multiple data files. It’s not scanning whole partition pos delete into memory when trying to merge-on-read.

P2: Any model changes we might need for new 2.0 storage model. eg, only convert certain partition, read “delta”, etc.

Zyiqin-Miranda mentioned this issue Jan 27, 2025

[Tests passing] [2.0] Add initial eq-to-pos delete job #356

Merged

pdames added this to the Compaction for Iceberg milestone Mar 24, 2025

pdames assigned Zyiqin-Miranda Mar 24, 2025

pdames added enhancement New feature or request P1 Resolve if not working on P0 (< 2 weeks) iceberg This issue is related to Apache Iceberg catalog support V2 Related to DeltaCAT V2 native catalog support labels Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Iceberg][Converter] Equality to Position Delete Converter #471

[Iceberg][Converter] Equality to Position Delete Converter #471

[Iceberg][Converter] Equality to Position Delete Converter #471

[Iceberg][Converter] Equality to Position Delete Converter #471

Comments