8000 UpdateTaskQueueConfig api changes by sivagirish81 · Pull Request #604 · temporalio/api · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

UpdateTaskQueueConfig api changes #604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

sivagirish81
Copy link
@sivagirish81 sivagirish81 commented Jun 20, 2025

READ BEFORE MERGING: All PRs require approval by both Server AND SDK teams before merging! This is why the number of required approvals is "2" and not "1"--two reviewers from the same team is NOT sufficient. If your PR is not approved by someone in BOTH teams, it may be summarily reverted.

What changed?

  • Boolean field added to DescribeTaskQueue request to control sending task queue config in the response.
  • New messages for RateLimit and TaskQueueConfig
  • UpdateTaskQueueConfig api implementation: Created a UpdateTaskQueueConfigRequest and UpdateTaskQueueConfigResponse.

Why?

  • Server needs an API endpoint for update TaskQueue RateLimits.

Breaking changes

  • No.

Server PR

SDK teams before merging! This is why the number of required approvals
is "2" and not "1"--two reviewers from the same team is NOT sufficient.
If your PR is not approved by someone in BOTH teams, it may be summarily
reverted._

<!-- Describe what has changed in this PR -->
**What changed?**
+ Boolean field added to DescribeTaskQueue request to control sending task queue config in the response.
+ New messages for RateLimit and TaskQueueConfig
+ UpdateTaskQueueConfig api implementation: Created a UpdateTaskQueueConfigRequest and UpdateTaskQueueConfigResponse.

<!-- Tell your future self why have you made these changes -->
**Why?**
Server needs an API endpoint for update TaskQueue RateLimits.

<!-- Are there any breaking changes on binary or code level? -->
**Breaking changes**
Yes. New Api's are breaking changes

<!-- If this breaks the Server, please provide the Server PR to merge
right after this PR was merged. -->
**Server PR**
None at the moment.
@sivagirish81 sivagirish81 requested review from a team as code owners June 20, 2025 16:11
@CLAassistant
Copy link
CLAassistant commented Jun 20, 2025

CLA assistant check
All committers have signed the CLA.

@@ -1086,6 +1086,9 @@ message DescribeTaskQueueRequest {
// Report task reachability for the requested versions and all task types (task reachability is not reported
// per task type).
bool report_task_reachability = 10 [deprecated = true];

// Show Task Queue Config
bool show_task_queue_config = 11;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm that task queue config is expensive to fetch therefore justifying it having to be opt-in?

Copy link
Member

Choose a reason for hiding this comment

The reason will 10000 be displayed to describe this comment to others. Learn more.

It's not expensive to fetch (it should be kept in memory always), but it could be a little large.. do we want selective returning in that case? I'm okay with always returning it to simplify things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I have no strong opinion. I wish describe were infrequent, but with stats on it I suspect it may be a bit frequent. I am less concerned with data size than performance. Up to y'all, just wanted to toss this out there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we translating these opt-in bools in the sdk/cli so far? Is it annoying to add more? (for us or the user)

Size/performance: in some sense size affects performance since we have to serialize and send it. But we're not going to do an extra database read or rpc to get this data, fwiw

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we translating these opt-in bools in the sdk/cli so far? Is it annoying to add more?

We haven't really for any of this newer stuff. Unfortunately these APIs are being designed without cross-SDK design, but it should be just trivial booleans there too. So Go does have an old form at https://pkg.go.dev/go.temporal.io/sdk/internal#DescribeTaskQueueEnhancedOptions but I think I heard "enhanced" is being deprecated before it ever stabilized.

Is it annoying to add more?

Not really. Just opt-in bools that affect whether some data is in the response I think.

string identity = 2;

// Selects the task queue to update.
string task_queue_name = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DescribeTaskQueue we accept a full TaskQueue type, would there be value in doing so here? Or maybe it was a mistake to do so there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were a lot of historical mistakes... the main one was not to put type in TaskQueue. Kind is a mostly useless value and all these config/versioning/deployment apis don't work on sticky queues so there's not really any reason to use TaskQueue there (except on polls).

There is kind of a reason to take TaskQueue on Describe though: Describe will cause the tq to be loaded if it's not loaded, and when we load it we should have kind available to switch some behavior. And you can Describe a sticky queue. But you can't configure it, so the plain string here makes sense, even if it feels inconsistent.

(Arguably we shouldn't have had a kind enum either and gotten the sticky bit from somewhere higher level)

Comment on lines 350 to 352
// Unless modified, this is the system-defined rate limit.
RateLimitConfig queue_rate_limit = 3;
RateLimitConfig fairness_keys_rate_limit_default = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, these two fields are never null?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll have queue_rate_limit returned with some value no matter what, but the fairness key limit can be missing.

Comment on lines 319 to 326
message RateLimit {

// Zero is a valid rate limit.
float requests_per_second = 1;

}

message ConfigMetadata {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like both of these should just be inlined into RateLimitConfig, esp. since they relate to each other.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s a use case where omitting the rate_limit field signals that we want to reset to the worker defaults. By not inlining metadata within RateLimit, we retain the flexibility to provide context while intentionally unsetting the limit. This avoids ambiguity—since 0 is a valid value for requests_per_second.

Comment on lines 331 to 332
// Can be "system", "worker" or "api".
string source = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be an enum if it has known fixed values

@@ -1113,6 +1116,8 @@ message DescribeTaskQueueResponse {
// Only returned in ENHANCED mode.
// This map contains Task Queue information for each Build ID. Empty string as key value means unversioned.
map<string, temporal.api.taskqueue.v1.TaskQueueVersionInfo> versions_info = 3 [deprecated = true];

temporal.api.taskqueue.v1.TaskQueueConfig configs = 6;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to oth 6DAF ers. Learn more.

Plural would imply repeated here, and it seems like it should be? Otherwise how are we getting different configs for different task_queue_types?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the DescribeTaskQueue we do accept a TaskQueueType and if nothing is provided we default to the TASK_QUEUE_TYPE_WORKFLOW. So at any point of time we will only be returning a config corresponding to a TaskQueueType so it will be singular itself. I have fixed this. We have depricated the enhanced version of the api so these config changes will not be affecting that flow.

@@ -1194,4 +1194,17 @@ service WorkflowService {
}
};
}

// Update taskqueue config and uncouple taskqueue settings from the worker lifecycle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment reads oddly. "...and uncouple" implies that the uncoupling happens as part of the update. Is that true?

IE: Does calling this mean that if, previously, workers had been setting a ratelimit, any future attempts by workers to do so will no longer be respected? If so, that should be spelled out in more detail here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a rate limit set by this api overrides the worker-set rate limit. If it's unset, then the worker-set limit takes effect again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yeah, comment needs way more detail about that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment.

Comment on lines 2426 to 2437
message UpdateQueueRateLimit {
// Rate limit to apply to task queue.
// It must be less than or equal to the system-level setting at the time
// of the request. If null, removes the existing rate limit.
RateLimitUpdate rate_limit = 1;
}

message UpdateFairnessKeyRateLimitDefault {
// Default rate limit to apply for all fairness keys.
// If null, removes the existing rate limit.
RateLimitUpdate rate_limit = 1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These also seem like they could be inlined unless we expect more things to go in here besides the rate limit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch is that there needs to be a way to unset the rate limit (either one or both). With this schema you can do that by providing an empty UpdateQueueRateLimit or UpdateFairnessKeyRateLimitDefault. If they're inlined you need an extra bool unset in RateLimitUpdate. Which is a valid way to do it if you prefer that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, that makes sense

@dnr
Copy link
Member
dnr commented Jun 20, 2025

Breaking changes

  • Yes. New Api's are breaking changes

New apis and messages are not breaking changes, let's not be unnecessarily scary here

@@ -2407,3 +2412,48 @@ message ListWorkersResponse {
// Next page token
bytes next_page_token = 2;
}

message UpdateTaskQueueConfigRequest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need some kind of conflict-token approach to ensure updates are not clashing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good question. My initial inclination is no/add it later: for versioning/deployment stuff, the state is more complex and you might make changes that depend (implicitly) on other parts of the state. Here, we have a bag of LWW fields. It doesn't feel necessary. But if we want to add it later we can.

@sivagirish81 sivagirish81 requested a review from dnr June 25, 2025 16:54
@sivagirish81 sivagirish81 marked this pull request as draft June 25, 2025 19:34
sivagirish81 and others added 2 commits June 25, 2025 13:47
+ Only the config data set from the api is persisted the default configs w.r.t to workers are in memory.
+ Source need not be persisted in this case.
@sivagirish81 sivagirish81 requested a review from dnr June 26, 2025 00:58
@sivagirish81 sivagirish81 marked this pull request as ready for review June 26, 2025 16:22
Copy link
Member
@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking from my POV

// Only populated if show_task_queue_config is set to true.
temporal.api.taskqueue.v1.TaskQueueConfig config = 6;

// Source of the update
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "the update"?

The source is a property of the effective rate limit, and there's no field here that has the effective rate limit. If you want to add one now, maybe we should add a submessage of DescribeTaskQueueResponse that has a taskqueue.RateLimit and a source?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new EffectiveRateLimit message and introduce a float to store the effective rate limit and the source of the update.

// If the overall queue rate limit is unset, the worker-set rate limit takes effect.
rpc UpdateTaskQueueConfig (UpdateTaskQueueConfigRequest) returns (UpdateTaskQueueConfigResponse) {
option (google.api.http) = {
post: "/namespaces/{namespace}/task-queues/{task_queue_name}/update-config"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, just looking at this now: how do we get the type in this url? it has just the name. Same question for DescribeTaskQueue, actually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't get the type in this url. In case of describeTaskQueue we pass the task queue type as a query parameter and in case if the UpdateTaskQueueConfig api we will get the TaskQueueType as part of the request body. Is there any reason why we would want it as part of the url?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the url and the additional binding to add the {task_queue_type} as well in the url to be indicative of the exact resource that is being updated as part of the POST request.

@sivagirish81 sivagirish81 requested a review from dnr June 30, 2025 23:25
Comment on lines +1114 to +1125
message EffectiveRateLimit {
// The effective rate limit for the task queue.
float requests_per_second = 1;

// Source of the RateLimit Configuration,which can be one of the following values:
// - SOURCE_API: The rate limit that is set via the TaskQueueConfig api.
// - SOURCE_WORKER: The rate limit is the value set using the workerOptions in TaskQueueActivitiesPerSecond.
// - SOURCE_SYSTEM: The rate limit is the default value set by the system
temporal.api.enums.v1.RateLimitSource rate_limit_source = 2;
}

EffectiveRateLimit effective_rate_limit = 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this wouldn't be considered stats or config?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For historical reasons it is kind of a different thing:

I want "config" to be a more resource-oriented value: you set/unset it, you get back what you set/unset.

But because we previously had a rate limit passed from the workers, there can be a rate limit that is set, but not part of the "config". We want to report that for informative purposes and to avoid confusion, but not in the config.

I also don't think it's "stats", it has nothing to do with the actual flow of tasks, historical or present. It's just a sort of runtime piece of data that comes from one of those three places and one of them is in effect at the moment.

Even if it was sorta "stats", it shouldn't go in temporal.api.taskqueue.v1.TaskQueueStats because that's a per-physical-queue thing (which can be aggregated) and we are/will be using that message to report per-priority-level stats, but the rate limit applies to the whole queue, above priority levels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just came off a bit inconsistent because we have these other two messages that you have to opt in to viewing, are effectively-single-use messages in the taskqueue package, etc. We just a bit strange. Also, I assume that unlike the config which you opt in to, this we have decided we will always return no matter what because it won't be large and config will?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's no way to do it that's not awkward somewhere: either config has something that's not a config (or at least has a very different data flow from the rest of config), or stats has something that's not a stat, or we have this unusual "effective rate limit" thing. And yeah, config may be large, stats needs to be aggregated across partitions, but effective rate limit is a single value we can just pull from memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 We will probably be able to make it look prettier in SDKs if/when we make high level forms of this. Can you confirm these new fields are set regardless of "mode"?

Copy link
Member
@dnr dnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this now, maybe just that one rename

Comment on lines +343 to +344
RateLimit rate_limit = 1;
ConfigMetadata metadata = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is all subjective and can be case-by-case. Personally I'm much more in favor of message reuse than we do. I think a RequestMetadata message would be great! There's lots of benefits from that.

And here, I do see some benefits from a common ConfigMetadata. But of course the SDK API can inline everything if that's how you want to do it in SDKs. Why does ConfigMetadata show up in intellisense when working with SDKs? It's a proto message in a different package/module.

Comment on lines +1114 to +1125
message EffectiveRateLimit {
// The effective rate limit for the task queue.
float requests_per_second = 1;

// Source of the RateLimit Configuration,which can be one of the following values:
// - SOURCE_API: The rate limit that is set via the TaskQueueConfig api.
// - SOURCE_WORKER: The rate limit is the value set using the workerOptions in TaskQueueActivitiesPerSecond.
// - SOURCE_SYSTEM: The rate limit is the default value set by the system
temporal.api.enums.v1.RateLimitSource rate_limit_source = 2;
}

EffectiveRateLimit effective_rate_limit = 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For historical reasons it is kind of a different thing:

I want "config" to be a more resource-oriented value: you set/unset it, you get back what you set/unset.

But because we previously had a rate limit passed from the workers, there can be a rate limit that is set, but not part of the "config". We want to report that for informative purposes and to avoid confusion, but not in the config.

I also don't think it's "stats", it has nothing to do with the actual flow of tasks, historical or present. It's just a sort of runtime piece of data that comes from one of those three places and one of them is in effect at the moment.

Even if it was sorta "stats", it shouldn't go in temporal.api.taskqueue.v1.TaskQueueStats because that's a per-physical-queue thing (which can be aggregated) and we are/will be using that message to report per-priority-level stats, but the rate limit applies to the whole queue, above priority levels.

// If the overall queue rate limit is unset, the worker-set rate limit takes effect.
rpc UpdateTaskQueueConfig (UpdateTaskQueueConfigRequest) returns (UpdateTaskQueueConfigResponse) {
option (google.api.http) = {
post: "/namespaces/{namespace}/task-queues/{task_queue_name}/update-config/{task_queue_type}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw @cretz did you have an opinion on task queue type in URLs? In my mind it should clearly be in the URL since it's identifying the resource being modified. But why didn't we do it for DescribeTaskQueue?

Copy link
Member
@cretz cretz Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this from the URL (sorry I didn't see this before).

Suggested change
post: "/namespaces/{namespace}/task-queues/{task_queue_name}/update-config/{task_queue_type}"
post: "/namespaces/{namespace}/task-queues/{task_queue_name}/update-config"

Traditionally the URLs have just been named qualifiers and if there are other qualifiers they are via body/query, even required ones (though in describe's case it defaults to workflow IIUC). Basically the URL alone isn't all things needed to qualify what resource your affecting, it's just a name for the API call on the resource (same for run ID on workflow calls), recognizing the URL may not represent a single resource like it would if we were more RESTful. If we had a single string that uniquely identified a task queue and type, that'd be worth using instead, but it's awkward to pack tuples into URLs.

GET https://temporal-api/namespaces/my-namespace/task-queues/my-task-queue?task_queue_type=TASK_QUEUE_TYPE_ACTIVITY works for describe, and here you'll have to use POST https://temporal-api/namespaces/my-namespace/task-queues/my-task-queue/update-config with the full body including type and other things. This is why it makes sense for this to be POST /namespaces/{namespace}/task-queues/{task_queue_name}/update-config instead of POST /namespaces/{namespace}/task-queues/{task_queue_name}, because we are not truly RESTful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really make sense to me to have the namespace name and task queue name in one place and the type in another. You need all three to identify the task queue. But I also don't care, so I'm fine with removing the type here.

Copy link
Member
@cretz cretz Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our URLs are not about uniquely identifying something because they can't do so well with our non-single-key resource design (same for workflows, activities, workers, etc). This is especially true with operations that sometimes have optional items to make up the key (such as some task queue and workflow calls). We just need a common base for operations on the general thing from an HTTP API simplicity POV.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the {task_queue_type} from the url as suggested.

Copy link
Member
@cretz cretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only slight request to remove the enumerate from the task queue URL for consistency, otherwise LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0