8000 crosscluster: monitor lagging spans by msbutler · Pull Request #134090 · cockroachdb/cockroach · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

crosscluster: monitor lagging spans #134090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 8, 2024

Conversation

msbutler
Copy link
Collaborator
@msbutler msbutler commented Nov 1, 2024

This patch teaches the ldr to collect and aggregration the count of source side ranges undergoing catchup and initial scans. In addition this patch reports this information in the job's running status and fraction completed.

Epic: none

Release note: none

@msbutler msbutler self-assigned this Nov 1, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler msbutler marked this pull request as ready for review November 1, 2024 19:53
@msbutler msbutler requested review from a team as code owners November 1, 2024 19:53
@msbutler msbutler requested review from xinhaoz, azhu-crl, mw5h and dt and removed request for a team, xinhaoz, azhu-crl and mw5h November 1, 2024 19:53
@msbutler msbutler force-pushed the butler-ldr-dest-metrics branch from 6b3077e to 0871b39 Compare November 2, 2024 21:09
@msbutler msbutler requested a review from stevendanna November 4, 2024 00:49
@msbutler msbutler force-pushed the butler-ldr-dest-metrics branch from 0871b39 to 60bc14e Compare November 4, 2024 01:58
@msbutler
Copy link
Collaborator Author
msbutler commented Nov 4, 2024

Initial scan on job page:
image

All caught up:
image

Just after a resume:
image

Catchup scans:
image

@msbutler msbutler force-pushed the butler-ldr-dest-metrics branch 2 times, most recently from f1f4403 to 5eb1bf7 Compare November 4, 2024 02:53
@msbutler
Copy link
Collaborator Author
msbutler commented Nov 4, 2024

Scanning and Lagging ranges metrics, because the source only polls once a minute, any initial or catchup scan that takes less than a minute (like this one) will look slightly off.
image

@msbutler msbutler force-pushed the butler-ldr-dest-metrics branch 2 times, most recently from bf4b58d to a49629e Compare November 5, 2024 14:44
@msbutler
Copy link
Collaborator Author
msbutler commented Nov 5, 2024

unrelated unit test flake

)

type rangeStatsByProcessorID struct {
mu syncutil.Mutex
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the mutex required here? Are rows and producer metas produced in parallel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know if they can produced in parallel, but i thought better safe than sorry, especially because these apis aren't called too often.

@@ -329,6 +329,8 @@ message RemoteProducerMetadata {
(gogoproto.customname) = "FlowID",
(gogoproto.customtype) = "FlowID"];
optional bool drained = 9 [(gogoproto.nullable) = false];
// ProcessorID is the ID of the processor that published the metadata.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit / (why isn't there a formatter for this 😢 ): this comment is indented with tabs when it should be spaces

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't tell you how many times i've tried to coax vscode to use spaces instead of tabs.

@@ -532,13 +552,31 @@ func (rh *rowHandler) handleRow(ctx context.Context, row tree.Datums) error {
HighWater: &replicatedTime,
}
}
progress.RunningStatus = fmt.Sprintf("logical replication running: %s", replicatedTime.GoTime())
progress.RunningStatus = status
if fractionCompleted > 0 {
Copy link
Collaborator
@jeffswenson jeffswenson Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only want to show the progress bar if 0 < fractionCompleted < 1. Currently, this is always overwriting the high watermark, so we will no longer show the high watermark when everything is caught up and advancing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea

Copy link
Collaborator Author
@msbutler msbutler Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if fraction completed is 0, the status is now "all %d ranges are caught up"

@msbutler msbutler force-pushed the butler-ldr-dest-metrics branch from a49629e to ab29769 Compare November 7, 2024 21:13
@msbutler msbutler added the do-not-merge bors won't merge a PR with this label. label Nov 7, 2024
@msbutler msbutler force-pushed the butler-ldr-dest-metrics branch 4 times, most recently from 0be0071 to 655565a Compare November 8, 2024 15:10
jeffswenson and others added 2 commits November 8, 2024 10:11
This patch teaches the ldr to collect and aggregration the count of source side
ranges undergoing catchup and initial scans. In addition this patch reports
this information in the job's running status and fraction completed.

Epic: none

Release note: none
Epic: none

Release note: this patch adds the following LDR metrics
 - logical_replication.catchup_ranges: the number of source side ranges
   conducting catchup scans.
 -logical_replication.scanning_ranges: the number source side ranges conducting
  initial scans.
Note that in the dbconsole, these metrics are not accurate if multiple LDR jobs
are running, though there exists the equivalent labeled metrics for a user to
consume via prometheus.
@msbutler msbutler force-pushed the butler-ldr-dest-metrics branch from 655565a to 5e35c20 Compare November 8, 2024 15:11
@msbutler msbutler removed the do-not-merge bors won't merge a PR with this label. label Nov 8, 2024
Copy link
Collaborator
@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@msbutler
Copy link
Collaborator Author
msbutler commented Nov 8, 2024

TFTR!

bors r=dt

@msbutler msbutler added the backport-24.3.x Flags PRs that need to be backported to 24.3 label Nov 8, 2024
@craig craig bot merged commit 87bbfe6 into cockroachdb:master Nov 8, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.3.x Flags PRs that need to be backported to 24.3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0