-
Notifications
You must be signed in to change notification settings - Fork 102
Adding implementations for vector stores #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b53d8a2
to
ccd9e48
Compare
d125eda
to
23bafa5
Compare
…mentation Also note that we need to now include deleting previous collections and restarting new ones / error handling for the dimension-mismatch case
66f47e3
to
4bafdb7
Compare
cur_min = len(df_idxs) | ||
if isinstance(vs, PineconeVS): | ||
cur_min = min(cur_min, 10000) | ||
K = min(K, cur_min) | ||
|
||
search_K = K | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should no longer be manually post filtering for the non-faiss vector stores -- instead we should perform a filtered vector search using the VS API (most of them let you filter on an exact match predicate)
@@ -32,12 +32,17 @@ def __call__(self, col_name: str, index_dir: str) -> pd.DataFrame: | |||
Returns: | |||
pd.DataFrame: The DataFrame with the index directory saved. | |||
""" | |||
if lotus.settings.rm is None: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be good to display a warning here, telling the user to not reset the df index, getting embedding ids requires the df index is invariant so that get_vectors_from_index returns correctly
metadatas: list[Mapping[str, Union[str, int, float, bool]]] = [{"doc_id": int(i)} for i in range(len(docs_list))] | ||
|
||
# Add documents in batches | ||
batch_size = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to add a vs_upsert_batch_size setting that can be configured in our settings module, rather than hard-coding it
@abstractmethod | ||
def search(self, | ||
queries: pd.Series | str | Image.Image | list | NDArray[np.float64], | ||
def __call__(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might make sense to add an ids parameter here, so that sem ops can make filtered search calls easily rather than re-implementing vs-specific logic each time
}) | ||
|
||
# Upsert in batches of 100 | ||
batch_size = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move batch-size to a settings configuration (same comment for all vs implementations)
rm_output: RMOutput = rm(queries, K) | ||
distances = rm_output.distances | ||
indices = rm_output.indices | ||
vs_output: RMOutput = vs(query_vectors, K) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to add a parameter for ids in vs() and set it to the index of the df we're searching over -- otherwise we won't catch the case when the df being searched over has been filtered by the user
# VS Only Tests | ||
################################################################################ | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an important test to add is doing filtered vector search -- ie the program starts with some df, embeds/indexes the column, does any filter op (can be a structured filter), then calls a sem op that uses search over the indexed column
Verified tests pass locally (for those that can be run locally at least)