-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Expose more cgroup memory metrics #37328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Uncompressed package size comparisonComparison with ancestor Diff per package
Decision |
743ae18
to
9584700
Compare
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: d080da0 Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | docker_containers_cpu | % cpu utilization | +3.70 | [+0.66, +6.73] | 1 | Logs |
➖ | quality_gate_logs | % cpu utilization | +2.93 | [+0.14, +5.72] | 1 | Logs bounds checks dashboard |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +2.83 | [+1.96, +3.70] | 1 | Logs |
➖ | docker_containers_memory | memory utilization | +1.36 | [+1.28, +1.44] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | +0.75 | [+0.69, +0.81] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | +0.67 | [+0.53, +0.81] | 1 | Logs bounds checks dashboard |
➖ | file_tree | memory utilization | +0.52 | [+0.35, +0.68] | 1 | Logs |
➖ | quality_gate_idle | memory utilization | +0.32 | [+0.25, +0.39] | 1 | Logs bounds checks dashboard |
➖ | otlp_ingest_metrics | memory utilization | +0.28 | [+0.12, +0.44] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | +0.11 | [-0.51, +0.73] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | +0.09 | [-0.51, +0.68] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | +0.09 | [-0.47, +0.64] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | +0.03 | [-0.52, +0.59] | 1 | Logs |
➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.02 | [-0.05, +0.09] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | +0.01 | [-0.22, +0.24] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | +0.01 | [-0.61, +0.63] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.02, +0.02] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.28, +0.26] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | -0.01 | [-0.61, +0.58] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.08 | [-0.67, +0.50] | 1 | Logs |
➖ | ddot_logs | memory utilization | -0.19 | [-0.29, -0.10] | 1 | Logs |
➖ | ddot_metrics | memory utilization | -0.44 | [-0.55, -0.32] | 1 | Logs |
➖ | otlp_ingest_logs | memory utilization | -0.55 | [-0.68, -0.43] | 1 | Logs |
Bounds Checks: ❌ Failed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
❌ | quality_gate_idle_all_features | memory_usage | 9/10 | bounds checks dashboard |
✅ | docker_containers_cpu | simple_check_run | 10/10 | |
✅ | docker_containers_memory | memory_usage | 10/10 | |
✅ | docker_containers_memory | simple_check_run | 10/10 | |
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | intake_connections | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | lost_bytes | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | memory_usage | 10/10 | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
❌ Failed. Some Quality Gates were violated.
- quality_gate_idle_all_features, bounds check memory_usage: 9/10 replicas passed. Failed 1 which is > 0. Gate FAILED.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All set from docs
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code itself LGTM, but I think there's perhaps more work needed on the naming and curation.
@@ -166,6 +166,18 @@ func (p *Processor) processContainer(sender sender.Sender, tags []string, contai | |||
p.sendMetric(sender.Rate, "container.memory.partial_stall", containerStats.Memory.PartialStallTime, tags) | |||
p.sendMetric(sender.MonotonicCount, "container.memory.page_faults", containerStats.Memory.Pgfault, tags) | |||
p.sendMetric(sender.MonotonicCount, "container.memory.major_page_faults", containerStats.Memory.Pgmajfault, tags) | |||
p.sendMetric(sender.Gauge, "container.memory.active_anon", containerStats.Memory.ActiveAnon, tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highlighting that we don't usually just push the names from underlying layer, there's a work to do to understand if these names can make sense in other context or conflict with other OSes.
Typically the container.memory.working_set_refault_anon
is, I think, not coherent with the definition of working_set
that already exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed working_set_refault_*
to refault_anon
/refault_file
. I think the working_set
prefix mainly come from the fact that refault detection is managed by the workingset detection.
For the lruvec metrics (active/inactive anon/file + unevictable), I think it's linux specific. At least, I don't think there are equivalent in Windows.
Looking at windows memory metrics, the nearest equivalent would be windows_memory_modified_page_list_bytes
with file_dirty
. Though the doc says that it's page not actively used by system cache so I don't think they would be strictly equivalent.
Outside of that, I haven't found clear equivalent with windows metrics.
9584700
to
14251c4
Compare
14251c4
to
35b4422
Compare
This PR adds additional metrics reported by cgroup: - shmem: Amount of memory used by tmpfs or shm_open - lruvec stats - active/inactive anon/file + unevictable - cache filesystem mapped/dirty/writeback - refault anon/file - page table size (cgroupv2 only) We currently only extract high level metrics from cgroup memory metrics like RSS and cache usage. This makes it difficult to understand to understand some OOM as shared memory and anonymous mmap is not covered. Plus, we don't have a lot of information regarding the state of the page cache. Those metrics will give a better insight on what's going on: how much active memory is used and whether the page cache is dirty or being flushed.
35b4422
to
93c5687
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bonnefoa, we keep track of metrics emitted by integrations in a file called metadata.csv, in this case it would be be https://github.com/DataDog/integrations-core/blob/master/container/metadata.csv . Since this PR adds new metrics that file should be updated as well, but it's located in another repo so a follow-up PR will be needed. Let me know if you have any question about metadata.csv!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this on behalf of Agent Integrations, but wanted to note that #container-integrations is the owner of the datadog-agent part this integration so they are better equipped to review this PR. There's another part in integrations-core - added a comment about that!
Follow up of DataDog/datadog-agent#37328 which adds additional memory metrics from cgroup.
Follow up of DataDog/datadog-agent#37328 which adds additional memory metrics from cgroup.
What does this PR do?
This PR adds additional metrics reported by cgroup:
Motivation
We currently only extract high level metrics from cgroup memory metrics like RSS and cache usage. This makes it difficult to understand to understand some OOM as shared memory and anonymous mmap are not covered.
Plus, we don't have a lot of information regarding the state of the page cache. Those metrics will give a better insight on what's going on: how much active memory is used and whether the page cache is dirty or being flushed.
Describe how you validated your changes
Unit tests were updated to add the new metrics. This was also manually tested that cgroupv1, cgroupv2 and containerd report the new metrics.

Possible Drawbacks / Trade-offs
Additional Notes