-
Notifications
You must be signed in to change notification settings - Fork 34
Benchmark time and cost of each pipeline element #691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Absolutely! This recently popped up for kwaak as well so costs can be transparent for the user. Do you have any ideas on how to make the data available for users? I recon there's multiple (via tracing with llmetry being one). Especially for token usage it would be nice if that data is also actionable. For embedding models, do you mean to try out a few, record their performance and compare?
Aside, if you're playing locally, the default model in fastembed (bge-small) is very fast and works quite alright for small'ish codebases (<20k, haven't tested with more). I also recently added jina, bit bigger, but smaller than the other mainstream general purpose models. |
Possibly lifecycle hooks on the streams? Emit a tracing event with the details at the end of processing each stream, as well as for each element of the stream? Alternatively, provide a hook into the metrics crate or autometrics. Last, you could have a manual context type object that collects notifications / progress info across streams. I'm unsure what approach would be best.
Yeah. Given a fairly consistent sized set of data to work on, being able to say "My source code is about 0.2x the size of swiftide's source code so based on how long I want this to run I should try llama3.1:7b rather than 14b", or "If I try loading my source using GPT-4o then I'll be broke by tuesday" are both fairly useful things to know.
I haven't checked any of the embedding models yet other than FastEmbed, but at a high level I also want to be able to look at how much time is spent doing embedding vs other actions. I'm mostly playing with Ollama for now as I have a fairly beastly M2 Max to play with. Will likely also explore the free tiers of online models (and I've used a variety of openai models in the past) |
Same here, guess we'll let it cook. I'll be picking up token estimation soon as I need to handle context limits anyway. It's similar, maybe it will give some insights.
I'd argue that the quality of the returned documents is more important than the performance. RAGAS is fairly popular here. Doing exactly that for kwaak right now. |
That's 100% true for production loads, but when you're experimenting with pipelines execution time governs how many experiments I can do per hour, and hence how fast I can learn how swiftide works. Knowing that I can make a tradeoff by sacrificing result quality for time while I'm still in the process of working out which parts to transform results is pretty useful. It's also useful to be able to see which transforms benefit most from these changes (e.g. perhaps the metadata transforms are costly but the difference between low quality but fast results don't massively change my query results in a given pipeline, so I can pick a small model for that task while allocating my time budget to the more useful parts elsewhere). |
That's interesting. Currently I do those kind of things with feature flags in a python notebook. That works, sort off. More data like you suggest and easier access would be very useful. |
Also relates to #156 |
My rationale is that I want one (rust based) tool for discovering how I want a tool to work. Mostly I'm in exploring mode for this sort of pipeline based thing right now so iteration speed is valuable. |
Note for later self or someone who is eager to contribute. Looking into the metrics crate and it looks really nice. I'd like it if naming is consistent with tracing where it applies, and it would also be nice if we can emit (or record) some of the same metrics in the span as well, and use the same consistent logic for otel metrics (if that ever gets a consistent api). Non exhaustive, I think a good start would be to measure something like the following: indexing:
chatcompletion/simple prompt/embed (important that these are relatable to where they were called and also aggregate-able):
query: agents:
|
Is your feature request related to a problem? Please describe.
When making a decision on which parts of a pipeline to use, time and cost are two significant key metrics that matter.
Describe the solution you'd like
Measure the time and cost (and perhaps token count) for each pipeline item - perhaps use the swiftide code base as the target?
E.g. how much does running MetadataQAText cost, how fast are the various embedding libs, etc.
Describe alternatives you've considered
Additional context
While it makes sense to experiment with models and techniques in any, having a baseline for how much these are going to impact a pipeline before starting would be helpful. Even knowing that running an ollama based embedding vs an openai one takes 10x the time might make it easier to choose.
The text was updated successfully, but these errors were encountered: