Benchmark time and cost of each pipeline element #691

joshka · 2025-03-16T19:17:48Z

Is your feature request related to a problem? Please describe.
When making a decision on which parts of a pipeline to use, time and cost are two significant key metrics that matter.

Describe the solution you'd like
Measure the time and cost (and perhaps token count) for each pipeline item - perhaps use the swiftide code base as the target?

E.g. how much does running MetadataQAText cost, how fast are the various embedding libs, etc.

Describe alternatives you've considered

Additional context
While it makes sense to experiment with models and techniques in any, having a baseline for how much these are going to impact a pipeline before starting would be helpful. Even knowing that running an ollama based embedding vs an openai one takes 10x the time might make it easier to choose.

timonv · 2025-03-16T22:04:52Z

Absolutely! This recently popped up for kwaak as well so costs can be transparent for the user. Do you have any ideas on how to make the data available for users? I recon there's multiple (via tracing with llmetry being one). Especially for token usage it would be nice if that data is also actionable.

For embedding models, do you mean to try out a few, record their performance and compare?

ollama based embedding vs an openai one takes 10x the time

Aside, if you're playing locally, the default model in fastembed (bge-small) is very fast and works quite alright for small'ish codebases (<20k, haven't tested with more). I also recently added jina, bit bigger, but smaller than the other mainstream general purpose models.

joshka · 2025-03-16T22:20:43Z

Do you have any ideas on how to make the data available for users?

Possibly lifecycle hooks on the streams? Emit a tracing event with the details at the end of processing each stream, as well as for each element of the stream?

Alternatively, provide a hook into the metrics crate or autometrics.

Last, you could have a manual context type object that collects notifications / progress info across streams.

I'm unsure what approach would be best.

For embedding models, do you mean to try out a few, record their performance and compare?

Yeah. Given a fairly consistent sized set of data to work on, being able to say "My source code is about 0.2x the size of swiftide's source code so based on how long I want this to run I should try llama3.1:7b rather than 14b", or "If I try loading my source using GPT-4o then I'll be broke by tuesday" are both fairly useful things to know.

Aside, if you're playing locally, the default model in fastembed (bge-small) is very fast and works quite alright for small'ish codebases (<20k, haven't tested with more). I also recently added jina, bit bigger, but smaller than the other mainstream general purpose models.

I haven't checked any of the embedding models yet other than FastEmbed, but at a high level I also want to be able to look at how much time is spent doing embedding vs other actions.

I'm mostly playing with Ollama for now as I have a fairly beastly M2 Max to play with. Will likely also explore the free tiers of online models (and I've used a variety of openai models in the past)

timonv · 2025-03-17T10:29:26Z

I'm unsure what approach would be best.

Same here, guess we'll let it cook. I'll be picking up token estimation soon as I need to handle context limits anyway. It's similar, maybe it will give some insights.

Yeah. Given a fairly consistent sized set of data to work on, being able to say "My source code is about 0.2x the size of swiftide's source code so based on how long I want this to run I should try llama3.1:7b rather than 14b", or "If I try loading my source using GPT-4o then I'll be broke by tuesday" are both fairly useful things to know.

I'd argue that the quality of the returned documents is more important than the performance. RAGAS is fairly popular here. Doing exactly that for kwaak right now.

joshka · 2025-03-17T22:34:37Z

I'd argue that the quality of the returned documents is more important than the performance. RAGAS is fairly popular here. Doing exactly that for kwaak right now.

That's 100% true for production loads, but when you're experimenting with pipelines execution time governs how many experiments I can do per hour, and hence how fast I can learn how swiftide works. Knowing that I can make a tradeoff by sacrificing result quality for time while I'm still in the process of working out which parts to transform results is pretty useful.

It's also useful to be able to see which transforms benefit most from these changes (e.g. perhaps the metadata transforms are costly but the difference between low quality but fast results don't massively change my query results in a given pipeline, so I can pick a small model for that task while allocating my time budget to the more useful parts elsewhere).

timonv · 2025-03-18T08:51:19Z

That's interesting. Currently I do those kind of things with feature flags in a python notebook. That works, sort off. More data like you suggest and easier access would be very useful.

timonv · 2025-03-18T08:57:51Z

Also relates to #156

joshka · 2025-03-18T13:06:06Z

That's interesting. Currently I do those kind of things with feature flags in a python notebook. That works, sort off. More data like you suggest and easier access would be very useful.

My rationale is that I want one (rust based) tool for discovering how I want a tool to work. Mostly I'm in exploring mode for this sort of pipeline based thing right now so iteration speed is valuable.

timonv · 2025-05-03T14:50:40Z

Note for later self or someone who is eager to contribute. Looking into the metrics crate and it looks really nice.

I'd like it if naming is consistent with tracing where it applies, and it would also be nice if we can emit (or record) some of the same metrics in the span as well, and use the same consistent logic for otel metrics (if that ever gets a consistent api).

Non exhaustive, I think a good start would be to measure something like the following:

indexing:

throughput
total time
num nodes processed
per named step the time, throughput and nodes

chatcompletion/simple prompt/embed (important that these are relatable to where they were called and also aggregate-able):

time
throughput
model
tokens

query:
Guessing something similar to indexing. But, like agents, it's more experimental than indexing and subject to change.

agents:

time per cycle
tool call with time (and name)
tokens as well?

timonv added the enhancement New feature or request label Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark time and cost of each pipeline element #691

Benchmark time and cost of each pipeline element #691

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Benchmark time and cost of each pipeline element #691

Benchmark time and cost of each pipeline element #691

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!