8000 Optimize completion query, benchmarks, JobInsertFullMany by bgentry · Pull Request #904 · riverqueue/river · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Optimize completion query, benchmarks, JobInsertFullMany #904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 18, 2025

Conversation

bgentry
Copy link
Contributor
@bgentry bgentry commented May 17, 2025

As part of a recent experiment, I was paying a lot of attention to the job completion query. I did some analysis of it and figured out that this refactoring can speed it up by about 15%. It also appears to increase throughput significantly.

In the process of figuring this out I decided to start a driver benchmark suite similar to our test suites. I also needed access to JobInsertFullMany to insert many jobs in a variety of states, so I pulled that over from Pro.

Query stats

Collected stats for benchmark runs, resetting between each.

master

       total_min        |        avg_ms         | calls |                                                                                                                                                                                                              query                                                  
                                                                                                                                                             
------------------------+-----------------------+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------
    0.24928899236666663 |    57.528229007692325 |   260 | -- name: JobSetStateIfRunningMany :many       
    0.23806803375000005 |    53.699556484962415 |   266 | -- name: JobSetStateIfRunningMany :many                                                                                                                                                                                                                                                                                                                                                                                         +

With this change

       total_min        |         avg_ms         | calls |                                                                                                                                                                                                              query                                                                                                                                                                                                               
------------------------+------------------------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    0.20998258465000008 |      46.31968779044117 |   272 | -- name: JobSetStateIfRunningMany :many                                                                                                                                                                                                                                                                                                                                                                                         +
     0.2078785631666666 |     47.066844490566034 |   265 | -- name: JobSetStateIfRunningMany :many                                                                                                                                                                                                                                                                                                                                                                                         +

Benchmark

These benchmarks are completing 2000 jobs using a variety of job states to hit all clauses in the query.

On master

$ go test -benchmem -run=^$ -bench ^BenchmarkDriverRiverPgxV5$ github.com/riverqueue/river -benchtime 30s
goos: darwin
goarch: arm64
pkg: github.com/riverqueue/river
cpu: Apple M4 Pro
BenchmarkDriverRiverPgxV5/JobSetStateIfRunningMany-14         	      33	  31044995 ns/op	 4605439 B/op	   53534 allocs/op
BenchmarkDriverRiverPgxV5/JobSetStateIfRunningMany-14         	      36	  34336087 ns/op	 4600520 B/op	   53532 allocs/op
BenchmarkDriverRiverPgxV5/JobSetStateIfRunningMany-14                987   34195653 ns/op         4596863 B/op      53533 allocs/op

With this change

goos: darwin
goarch: arm64
pkg: github.com/riverqueue/river
cpu: Apple M4 Pro
BenchmarkDriverRiverPgxV5/JobSetStateIfRunningMany-14         	      36	  30853128 ns/op	 4595252 B/op	   53530 allocs/op
BenchmarkDriverRiverPgxV5/JobSetStateIfRunningMany-14               1131   32302842 ns/op         4595874 B/op      53532 allocs/op

riverbench

The overall results are inconsistent as I run this several times (as always) but the pattern is consistent: this change bumps throughput by around 8-10k jobs/sec.

On master

➜  river git:(bg-producer-completion) ✗ go run ./cmd/river bench --database-url "postgres://localhost/river_test" --num-total-jobs 1_000_000
bench: jobs worked [          0 ], inserted [    1000000 ], job/sec [        0.0 ] [0s]
May 17 14:27:43.144 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
May 17 14:27:44.291 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     106645 ], inserted [          0 ], job/sec [    53322.5 ] [2s]
May 17 14:27:45.349 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     112089 ], inserted [          0 ], job/sec [    56044.5 ] [2s]
May 17 14:27:46.631 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
May 17 14:27:47.895 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     112316 ], inserted [          0 ], job/sec [    56158.0 ] [2s]
May 17 14:27:48.977 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     102054 ], inserted [          0 ], job/sec [    51027.0 ] [2s]
May 17 14:27:50.469 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
May 17 14:27:51.667 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     111327 ], inserted [          0 ], job/sec [    55663.5 ] [2s]
May 17 14:27:52.696 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
May 17 14:27:54.175 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     114242 ], inserted [          0 ], job/sec [    57121.0 ] [2s]
May 17 14:27:55.626 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     122061 ], inserted [          0 ], job/sec [    61030.5 ] [2s]
May 17 14:27:56.911 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     108867 ], inserted [          0 ], job/sec [    54433.5 ] [2s]
May 17 14:27:58.618 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     110399 ], inserted [          0 ], job/sec [    55199.5 ] [2s]
bench: total jobs worked [    1000000 ], total jobs inserted [    1000000 ], overall job/sec [    55751.7 ], running 17.936661625s

With change:

➜  river git:(bg-optimize-completion-query) ✗ go run ./cmd/river bench --database-url "postgres://localhost/river_test" --num-total-jobs 1_000_000
bench: jobs worked [          0 ], inserted [    1000000 ], job/sec [        0.0 ] [0s]
May 17 14:29:58.613 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
May 17 14:29:59.806 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     125322 ], inserted [          0 ], job/sec [    62661.0 ] [2s]
May 17 14:30:00.894 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     128115 ], inserted [          0 ], job/sec [    64057.5 ] [2s]
May 17 14:30:02.264 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
May 17 14:30:03.531 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     134544 ], inserted [          0 ], job/sec [    67272.0 ] [2s]
May 17 14:30:04.930 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     128138 ], inserted [          0 ], job/sec [    64069.0 ] [2s]
May 17 14:30:06.822 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     139629 ], inserted [          0 ], job/sec [    69814.5 ] [2s]
May 17 14:30:09.124 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [     132242 ], inserted [          0 ], job/sec [    66121.0 ] [2s]
bench: jobs worked [     138071 ], inserted [          0 ], job/sec [    69035.5 ] [2s]
May 17 14:30:12.062 WRN jobcompleter.BatchCompleter: Hit maximum backlog; completions will wait until below threshold max_backlog=20000
bench: jobs worked [      73939 ], inserted [          0 ], job/sec [    36969.5 ] [2s]
bench: total jobs worked [    1000000 ], total jobs inserted [    1000000 ], overall job/sec [    66571.6 ], running 15.021423208s

@bgentry bgentry requested a review from brandur May 17, 2025 21:06
@bgentry bgentry force-pushed the bg-optimize-completion-query branch from 41e32e3 to 673bd71 Compare May 17, 2025 21:09
Comment on lines 502 to 504
func (e *Executor) JobInsertFullMany(ctx context.Context, params *riverdriver.JobInsertFullManyParams) ([]*rivertype.JobRow, error) {
// TODO: Implement this
return nil, nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't totally sure what to do here and didn't want to rabbithole on it while getting the rest of this working. Looks like I just need to loop over single inserts as is done with JobInsertFastMany?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's no way to do batch insertions right now, so the right answer for this one is just to iterate in the driver and call into the normal JobInsertFull a bunch of times.

@bgentry bgentry force-pushed the bg-optimize-completion-query branch 2 times, most recently from f62cb06 to dbbedfe Compare May 18, 2025 00:09
@bgentry bgentry force-pushed the bg-optimize-completion-query branch from dbbedfe to e8c9f77 Compare May 18, 2025 02:13
@bgentry
Copy link
Contributor Author
bgentry commented May 18, 2025

@brandur down to one failing sqlite test that seems to be an issue with timezone conversions. You might have a better idea of what's going on there?

@brandur
Copy link
Contributor
brandur commented May 18, 2025

@brandur down to one failing sqlite test that seems to be an issue with timezone conversions. You might have a better idea of what's going on there?

It looks like you got it right?

Usually this stems from use of time.Now() in a driver test instead of time.Now().UTC(). All the row -> rivertype helpers in drivers look like this using UTC() all over the place:

	return &rivertype.JobRow{
		ID:           internal.ID,
		Attempt:      max(int(internal.Attempt), 0),
		AttemptedAt:  attemptedAt,
		AttemptedBy:  internal.AttemptedBy,
		CreatedAt:    internal.CreatedAt.UTC(),
		EncodedArgs:  internal.Args,
		Errors:       errors,
		FinalizedAt:  finalizedAt,
		Kind:         internal.Kind,
		MaxAttempts:  max(int(internal.MaxAttempts), 0),
		Metadata:     internal.Metadata,
		Priority:     max(int(internal.Priority), 0),
		Queue:        internal.Queue,
		ScheduledAt:  internal.ScheduledAt.UTC(),
		State:        rivertype.JobState(internal.State),
		Tags:         internal.Tags,
		UniqueKey:    internal.UniqueKey,
		UniqueStates: uniquestates.UniqueBitmaskToStates(uniqueStatesByte),
	}, nil

Postgres has timezones in its timestamps, so it can tolerate a time being put in on one timezone, keep track of timezone, and then compare to another time in a different zone, but SQLite is much less tolerant of all of that.

Copy link
Contributor
@brandur brandur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff man. The new query seems easier to me to read anyway — fewer CTEs is always better.

@bgentry
Copy link
Contributor Author
bgentry commented May 18, 2025

It looks like you got it right?

No, actually, it just passes in CI bc it's on UTC :) I fixed it now locally and in CI.

Usually this stems from use of time.Now() in a driver test instead of time.Now().UTC(). All the row -> rivertype helpers in drivers look like this using UTC() all over the place:

I found a handful of places in the driver that were sending timestamps to the DB without converting them to UTC, including the helpers timeString and timeStringNullable. I updated these to consistently use UTC as that appears to be what's required, though I'm wondering if there might be some missing test coverage on some of these or reasons certain ones don't need to be converted to UTC first.

@bgentry
Copy link
Contributor Author
bgentry commented May 18, 2025

@brandur I don't think there's any risk to merging here w/ the sqlite changes I made but you should definitely check them out and see if I did something dumb or unnecessary, or if we're missing test coverage and I fixed real issues. 8000 I for sure had to make the change for JobInsertFullMany to get the insert to use the right timestamp on attempted_at, and then another couple tests broke in response to the timeString change which I fixed with .UTC() additions. And then a few I added just after skimming the driver file despite no breakage on either side of the change.

@bgentry bgentry merged commit 6cd3ca9 into master May 18, 2025
10 checks passed
@bgentry bgentry deleted the bg-optimize-completion-query branch May 18, 2025 22:06
@@ -196,7 +196,7 @@ func (e *Executor) JobCancel(ctx context.Context, params *riverdriver.JobCancelP
return dbutil.WithTxV(ctx, e, func(ctx context.Context, execTx riverdriver.ExecutorTx) (*rivertype.JobRow, error) {
dbtx := templateReplaceWrapper{dbtx: e.driver.UnwrapTx(execTx), replacer: &e.driver.replacer}

cancelledAt, err := params.CancelAttemptedAt.MarshalJSON()
cancelledAt, err := params.CancelAttemptedAt.UTC().MarshalJSON()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the only thing is that although Postgres has a timestamptz type, that doesn't extend to jsonb fields, so if we're UTCing everything in SQLite here, then we should probably be doing the same in Postgres too I think right?

That said, I think we should've avoided UTCing everything in the first place. Most people already operate in UTC, but in case they don't, we shouldn't force them to use it.

It might be too late for Postgres in that it's possibly breaking, but given SQLite is brand new, there might still be time. I'll take a pass to see if I can vet it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To recap, Postgres handles timestamptz in text mode by converting everything to utc prior to storing it in the DB; it does this based on session level settings. In binary mode, the client must do this conversion prior to sending the values over the wire.

In sqlite, however, there is no proper datetime type like timestampt, though there are some helpers for comparing and manipulating appropriately-formatted strings. It looks like these do actually support timezones.

What I was seeing prior to 7129534 is a test that was failing because time was being provided in my local UTC-5 zone, and then when returned after JobInsertFullMany it was returning that same time value but with a 0 offset for UTC—i.e. off by 5 hours. So somewhere along the line we were losing the timezone info and not converting correctly. If you revert that commit locally, you should see similar failures. I left it in a separate commit in this PR specifically to make that easy if you want to play around with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, all that makes sense. As written, moving everything to UTC was the right move given the current framework.

I was commenting on this line in particular though, which stores a timestamp to a JSON blob (metadata -> cancel_attempted_at) — in that case you just have a normal JSON string in both databases. i.e. timestamptz does not activate in Postgres.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh gotcha. Yeah I think in this specific case since that's an internal system-level timestamp it's probably fine to keep it all UTC

bgentry added a commit that referenced this pull request May 19, 2025
I meant to do this as part of #904 because the `riverdrivertest.go` file
is already super long, and the last thing we need is to add another
growing group of benchmarks to it.
bgentry added a commit that referenced this pull request May 19, 2025
I meant to do this as part of #904 because the `riverdrivertest.go` file
is already super long, and the last thing we need is to add another
growing group of benchmarks to it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0