fix(response): clean up chunk text #16

rachlllg · 2025-03-24T22:12:54Z

clean up chunk text in the response to remove prefix for easier downstream processing

Per discussion: freelawproject/courtlistener#5298 (comment)
We should clean up the chunk text in the response to remove the prefix for easier downstream processing and keeping the opinion text clean

clean up chunk text in the response to remove prefix for easier downstream processing

mlissner · 2025-03-25T19:14:40Z

inception/embedding_service.py

+        clean_chunks = [
+            chunk.replace("search_document: ", "") for chunk in all_chunks
+        ]


Hm, will this use a bunch more memory where a different solution could remove search_document further upstream so that a full copy (and manipulation) of the response won't be necessary?

The line right above this is where we pass all_chunks to the model to generate the embeddings, so we cannot do this upstream, unless one the three options:

We carry two copies of the chunks, one with the prefix, one without, and pass the one with prefix to the model and return the one without as the text

We create the chunks without the prefix but before passing the chunks to the models, we add the prefix, which is equivalent to this but in reverse

or, we simply do not use the prefix. The nomic_ai model is designed for query-context with prefix so this will impact the performance slightly

I don't think either of the first two options will be any better, this list comprehension will consume CPU memory, it will not impact GPU memory.

Cool. Got it. Let's stick with what you've got. Thanks.

fix(response): clean up chunk text

1333523

clean up chunk text in the response to remove prefix for easier downstream processing

rachlllg requested a review from mlissner March 24, 2025 22:17

mlissner reviewed Mar 25, 2025

View reviewed changes

mlissner merged commit e6908c8 into main Mar 25, 2025
5 checks passed

mlissner deleted the clean-up-chunk-text branch March 25, 2025 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(response): clean up chunk text #16

fix(response): clean up chunk text #16

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix(response): clean up chunk text #16

fix(response): clean up chunk text #16

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!