8000 Sequential disk writer by sam-herman · Pull Request #475 · datastax/jvector · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Sequential disk writer #475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 12, 2025
Merged

Conversation

sam-herman
Copy link
Collaborator

Description

Main motivation for this change is to introduce a disk writer that keeps immutability and is pure sequential.
This is essential for integration with frameworks such as Lucene and OpenSearch.

Changes

Additional changes in this PR

  • Introduce OnDiskSequentialGraphIndexWriter and abstract the GraphWriter interface
  • Fix serialization issue where previously written nodeIds were include when calling getNodes
  • Reuse cache for in memory loaded layers to remove unnecessary disk seeks and bugs due to branching of logic related to offset calculation.

Tests

Introduce tests for TestOnDiskSequentialGraphIndexWriter and add proper configuration for log4j debug logs in tests.

Signed-off-by: Samuel Herman <sherman8
8000
915@gmail.com>
Signed-off-by: Samuel Herman <sherman8915@gmail.com>
Signed-off-by: Samuel Herman <sherman8915@gmail.com>
Signed-off-by: Samuel Herman <sherman8915@gmail.com>
Signed-off-by: Samuel Herman <sherman8915@gmail.com>
Signed-off-by: Samuel Herman <sherman8915@gmail.com>
Signed-off-by: Samuel Herman <sherman8915@gmail.com>
Signed-off-by: Samuel Herman <sherman8915@gmail.com>
Signed-off-by: Samuel Herman <sherman8915@gmail.com>
Copy link
Collaborator
@marianotepper marianotepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! A few relatively minor comments here and there.

import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.file.Path;
import java.util.zip.CRC32;
Copy link
Collaborator
@marianotepper marianotepper Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this import can be removed? It seems that we do not use CRC32.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

// write sparse levels
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge the writing of the sparse levels and the separated features from OnDiskSequentialGraphIndexWriter and OnDiskGraphIndexWriter into functions in AbstractGraphIndexWriter? They seem to be exactly the same. Trying to avoid code repetition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

/**
* Builder for OnDiskGraphIndexWriter, with optional features.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should probably reference OnDIskSequentialGraphIndexWriter

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/**
* Builder for OnDiskGraphIndexWriter, with optional features.
*/
public static class Builder {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we merge most of this code into some abstract class? Maybe AbstractGraphIndexWriter.Builder? It looks like apart from the startOffset in OnDiskGraphIndexWriter.Builder, the rest is the same. Trying to avoid code repetition.

* - Base layer max degree
* - ID upper bound
* - Number of layers
* - Layer info (size and degree for each layer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something about the name of this variable in CommonHeader?
private static final int V4_MAX_LAYERS

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this one a little bit, I'm not really sure, because we haven't changed anything with max layers, so might be best to keep as is and perhaps put a comment next to it?
Open for suggestions on this one.

Copy link
Collaborator
@marianotepper marianotepper Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. We can leave it as is.

Signed-off-by: Samuel Herman <sherman8915@gmail.com>
Signed-off-by: Samuel Herman <sherman8915@gmail.com>
@marianotepper marianotepper self-requested a review June 12, 2025 17:45
Copy link
Collaborator
@marianotepper marianotepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the contribution!

@marianotepper marianotepper merged commit d0ccb32 into datastax:main Jun 12, 2025
8 checks passed
@sam-herman sam-herman deleted the sequential-disk-writer branch June 12, 2025 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0