8000 [Iceberg][Converter] Equality to Position Delete Converter · Issue #471 · ray-project/deltacat · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Iceberg][Converter] Equality to Position Delete Converter #471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Zyiqin-Miranda opened this issue Jan 27, 2025 · 0 comments
Open

[Iceberg][Converter] Equality to Position Delete Converter #471

Zyiqin-Miranda opened this issue Jan 27, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request iceberg This issue is related to Apache Iceberg catalog support P1 Resolve if not working on P0 (< 2 weeks) V2 Related to DeltaCAT V2 native catalog support

Comments

@Zyiqin-Miranda
Copy link
Member
Zyiqin-Miranda commented Jan 27, 2025

Converter using Pyiceberg library and Ray cluster compute to be able to convert Iceberg equality deletes to position deletes.

Initial PR merged

Tracking all feature-level TODOs in this issue:

Converter to-be implemented features list, tracking here for future PR references.

P0. Multiple identifier columns, column concatenating + relevant memory estimation change
P0. Verify pos delete written out can be read by Spark, probably done in unit test setup using 2.0 docker
P0. Switch to construct equality delete tables using Spark, probably done in a unit test using 2.0 docker Updated: Spark can't write equality deletes, using Pyiceberg to add equality deletes for testing.
P0. Any model changes we might need for new 2.0 storage model. eg, only convert certain partition, read “delta”, etc. Updated: Deprioritized, to P2.
P0. Daft sha1 hash support.
P0. Verify correct deduplication based off identify columns if there are multiple records in whether original data files or equality delete files.
P0. Add test cases for two partition specs with bucket transform.

P1. Pyarrow chunked array can not exceed 2GB
P1. Currently, Assuming 1 node can fit one hash bucket for now, adjust parallel data file to download in convert function.
P1. Investigate Pyiceberg replace snapshot committing. Currently, replace snapshot committing self-implemented is not working as expected. Definition of correct should be that we’re able to read the REPLACE snapshot using Spark. So currently reuse the OVERWRITE snapshot committing strategy from Pyiceberg.
P1. Investigate replace snapshot committing using_starting_sequence to avoid conflict. Not entirely sure we need this given that some workaround maybe feasible through internal catalog implementation. So deprioritize for P1.
P1. Merge/Compact small pos delete files support
P1. Spark read pos delete performace. Position delete can correctly be matched to corresponding data files by setting lower_bounds==upper_bounds==file_path even with multiple data files. It’s not scanning whole partition pos delete into memory when trying to merge-on-read.

P2: Any model changes we might need for new 2.0 storage model. eg, only convert certain partition, read “delta”, etc.

@pdames pdames added this to the Compaction for Iceberg milestone Mar 24, 2025
@pdames pdames added enhancement New feature or request P1 Resolve if not working on P0 (< 2 weeks) iceberg This issue is related to Apache Iceberg catalog support V2 Related to DeltaCAT V2 native catalog support labels Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request iceberg This issue is related to Apache Iceberg catalog support P1 Resolve if not working on P0 (< 2 weeks) V2 Related to DeltaCAT V2 native catalog support
Projects
None yet
Development

No branches or pull requests

2 participants
0