[Iceberg][Converter] Equality to Position Delete Converter #471
Labels
enhancement
New feature or request
iceberg
This issue is related to Apache Iceberg catalog support
P1
Resolve if not working on P0 (< 2 weeks)
V2
Related to DeltaCAT V2 native catalog support
Milestone
Converter using Pyiceberg library and Ray cluster compute to be able to convert Iceberg equality deletes to position deletes.
Initial PR merged
Tracking all feature-level TODOs in this issue:
Converter to-be implemented features list, tracking here for future PR references.
P0. Multiple identifier columns, column concatenating + relevant memory estimation change
P0. Verify pos delete written out can be read by Spark, probably done in unit test setup using 2.0 docker
P0. Switch to construct equality delete tables using Spark, probably done in a unit test using 2.0 dockerUpdated: Spark can't write equality deletes, using Pyiceberg to add equality deletes for testing.P0. Any model changes we might need for new 2.0 storage model. eg, only convert certain partition, read “delta”, etc.Updated: Deprioritized, to P2.P0. Daft sha1 hash support.
P0. Verify correct deduplication based off identify columns if there are multiple records in whether original data files or equality delete files.
P0. Add test cases for two partition specs with bucket transform.
P1. Pyarrow chunked array can not exceed 2GB
P1. Currently, Assuming 1 node can fit one hash bucket for now, adjust parallel data file to download in convert function.
P1. Investigate Pyiceberg replace snapshot committing. Currently, replace snapshot committing self-implemented is not working as expected. Definition of correct should be that we’re able to read the REPLACE snapshot using Spark. So currently reuse the OVERWRITE snapshot committing strategy from Pyiceberg.
P1. Investigate replace snapshot committing using_starting_sequence to avoid conflict. Not entirely sure we need this given that some workaround maybe feasible through internal catalog implementation. So deprioritize for P1.
P1. Merge/Compact small pos delete files support
P1. Spark read pos delete performace. Position delete can correctly be matched to corresponding data files by setting lower_bounds==upper_bounds==file_path even with multiple data files. It’s not scanning whole partition pos delete into memory when trying to merge-on-read.
P2: Any model changes we might need for new 2.0 storage model. eg, only convert certain partition, read “delta”, etc.
The text was updated successfully, but these errors were encountered: