8000 Adding implementations for vector stores by AmoghTantradi · Pull Request #79 · lotus-data/lotus · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Adding implementations for vector stores #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 58 commits into
base: main
Choose a base branch
from

Conversation

AmoghTantradi
Copy link
Collaborator
@AmoghTantradi AmoghTantradi commented Jan 14, 2025

Verified tests pass locally (for those that can be run locally at least)

cur_min = len(df_idxs)
if isinstance(vs, PineconeVS):
cur_min = min(cur_min, 10000)
K = min(K, cur_min)

search_K = K
while True:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should no longer be manually post filtering for the non-faiss vector stores -- instead we should perform a filtered vector search using the VS API (most of them let you filter on an exact match predicate)

@@ -32,12 +32,17 @@ def __call__(self, col_name: str, index_dir: str) -> pd.DataFrame:
Returns:
pd.DataFrame: The DataFrame with the index directory saved.
"""
if lotus.settings.rm is None:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be good to display a warning here, telling the user to not reset the df index, getting embedding ids requires the df index is invariant so that get_vectors_from_index returns correctly

metadatas: list[Mapping[str, Union[str, int, float, bool]]] = [{"doc_id": int(i)} for i in range(len(docs_list))]

# Add documents in batches
batch_size = 100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add a vs_upsert_batch_size setting that can be configured in our settings module, rather than hard-coding it

@abstractmethod
def search(self,
queries: pd.Series | str | Image.Image | list | NDArray[np.float64],
def __call__(self,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might make sense to add an ids parameter here, so that sem ops can make filtered search calls easily rather than re-implementing vs-specific logic each time

})

# Upsert in batches of 100
batch_size = 100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move batch-size to a settings configuration (same comment for all vs implementations)

rm_output: RMOutput = rm(queries, K)
distances = rm_output.distances
indices = rm_output.indices
vs_output: RMOutput = vs(query_vectors, K)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to add a parameter for ids in vs() and set it to the index of the df we're searching over -- otherwise we won't catch the case when the df being searched over has been filtered by the user

# VS Only Tests
################################################################################


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an important test to add is doing filtered vector search -- ie the program starts with some df, embeds/indexes the column, does any filter op (can be a structured filter), then calls a sem op that uses search over the indexed column

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0