8000 XDR] Update metrics, add DC specific metrics by Xaelias · Pull Request #41 · alicebob/asprom · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Jan 10, 2023. It is now read-only.

XDR] Update metrics, add DC specific metrics #41

Merged
8000 merged 2 commits into from
May 12, 2020

Conversation

Xaelias
Copy link
Contributor
@Xaelias Xaelias commented May 11, 2020

Yo!

We've had a couple XDR issues over the weekend, and it made me realize that some of the metrics names changed, some of the types are wrong, but also there are a bunch of new metrics we just don't have.
So I went over the list of XDR related metrics and here they are.

The aerospike summit is happening this week too. I'll be trying to find out if they have an ETA on native prometheus metrics. Depending on the answer, I'll take the time and go over the rest of the metrics to see what's missing, and what needs fixed.

Alexis

DC specific metrics

❯ curl -s localhost:9146/metrics | grep xdr_dc
# HELP aerospike_xdr_dc_as_open_conn Number of open connection to the Aerospike DC.
# TYPE aerospike_xdr_dc_as_open_conn gauge
aerospike_xdr_dc_as_open_conn{dc="<DC>"} 512
# HELP aerospike_xdr_dc_as_size The cluster size of the destination Aerospike DC.
# TYPE aerospike_xdr_dc_as_size gauge
aerospike_xdr_dc_as_size{dc="<DC>"} 8
# HELP aerospike_xdr_dc_ship_attempt Number of records that have been attempted to be shipped.
# TYPE aerospike_xdr_dc_ship_attempt counter
aerospike_xdr_dc_ship_attempt{dc="<DC>"} 2.148600371e+09
# HELP aerospike_xdr_dc_ship_bytes Number of bytes shipped for this DC.
# TYPE aerospike_xdr_dc_ship_bytes counter
aerospike_xdr_dc_ship_bytes{dc="<DC>"} 3.704474200994e+12
# HELP aerospike_xdr_dc_ship_delete_success Number of delete transactions that have been successfully shipped.
# TYPE aerospike_xdr_dc_ship_delete_success counter
aerospike_xdr_dc_ship_delete_success{dc="<DC>"} 1.27875185e+08
# HELP aerospike_xdr_dc_ship_destination_error Number of errors from the remote cluster(s) while shipping records for this DC.
# TYPE aerospike_xdr_dc_ship_destination_error counter
aerospike_xdr_dc_ship_destination_error{dc="<DC>"} 7649
# HELP aerospike_xdr_dc_ship_idle_avg Average number of ms of sleep for each record being shipped.
# TYPE aerospike_xdr_dc_ship_idle_avg gauge
aerospike_xdr_dc_ship_idle_avg{dc="<DC>"} 0.039
# HELP aerospike_xdr_dc_ship_idle_avg_pct Representation in percent of total time spent for dc_ship_idle_avg.
# TYPE aerospike_xdr_dc_ship_idle_avg_pct gauge
aerospike_xdr_dc_ship_idle_avg_pct{dc="<DC>"} 0
# HELP aerospike_xdr_dc_ship_inflight_objects Number of records that are inflight.
# TYPE aerospike_xdr_dc_ship_inflight_objects gauge
aerospike_xdr_dc_ship_inflight_objects{dc="<DC>"} 0
# HELP aerospike_xdr_dc_ship_latency_avg Moving average of shipping latency for the specific DC.
# TYPE aerospike_xdr_dc_ship_latency_avg gauge
aerospike_xdr_dc_ship_latency_avg{dc="<DC>"} 96
# HELP aerospike_xdr_dc_ship_source_error Number of client layer errors while shipping records for this DC.
# TYPE aerospike_xdr_dc_ship_source_error counter
aerospike_xdr_dc_ship_source_error{dc="<DC>"} 0
# HELP aerospike_xdr_dc_ship_success Number of records that have been successfully shipped.
# TYPE aerospike_xdr_dc_ship_success counter
aerospike_xdr_dc_ship_success{dc="<DC>"} 2.148592722e+09
# HELP aerospike_xdr_dc_timelag Time lag for this specific DC.
# TYPE aerospike_xdr_dc_timelag gauge
aerospike_xdr_dc_timelag{dc="<DC>"} 0

Other XDR metrics

❯ curl -s localhost:9146/metrics | grep xdr_ | grep -v xdr_dc_
# HELP aerospike_node_xdr_active_failed_node_sessions Number of active failed node sessions pending.
# TYPE aerospike_node_xdr_active_failed_node_sessions gauge
aerospike_node_xdr_active_failed_node_sessions 0
# HELP aerospike_node_xdr_active_link_down_sessions Number of active link down sessions pending.
# TYPE aerospike_node_xdr_active_link_down_sessions gauge
aerospike_node_xdr_active_link_down_sessions 0
# HELP aerospike_node_xdr_global_lastshiptime The minimum last ship time in millisecond (epoch) for XDR for across the cluster.
# TYPE aerospike_node_xdr_global_lastshiptime gauge
aerospike_node_xdr_global_lastshiptime 1.589236823599e+12
# HELP aerospike_node_xdr_hotkey_fetch xdr hotkey fetch
# TYPE aerospike_node_xdr_hotkey_fetch counter
aerospike_node_xdr_hotkey_fetch 1.2358708e+08
# HELP aerospike_node_xdr_hotkey_skip xdr hotkey skip
# TYPE aerospike_node_xdr_hotkey_skip counter
aerospike_node_xdr_hotkey_skip 1.67894531e+08
# HELP aerospike_node_xdr_queue_overflow_error xdr queue overflow error
# TYPE aerospike_node_xdr_queue_overflow_error counter
aerospike_node_xdr_queue_overflow_error 0
# HELP aerospike_node_xdr_read_active_avg_pct xdr read active avg pct
# TYPE aerospike_node_xdr_read_active_avg_pct gauge
aerospike_node_xdr_read_active_avg_pct 0
# HELP aerospike_node_xdr_read_error xdr read error
# TYPE aerospike_node_xdr_read_error counter
aerospike_node_xdr_read_error 1.3251562e+08
# HELP aerospike_node_xdr_read_idle_avg_pct xdr read idle avg pct
# TYPE aerospike_node_xdr_read_idle_avg_pct gauge
aerospike_node_xdr_read_idle_avg_pct 100
# HELP aerospike_node_xdr_read_latency_avg xdr read latency avg
# TYPE aerospike_node_xdr_read_latency_avg gauge
aerospike_node_xdr_read_latency_avg 0
# HELP aerospike_node_xdr_read_notfound xdr read notfound
# TYPE aerospike_node_xdr_read_notfound counter
aerospike_node_xdr_read_notfound 1.481835e+06
# HELP aerospike_node_xdr_read_reqq_used xdr read reqq used
# TYPE aerospike_node_xdr_read_reqq_used gauge
aerospike_node_xdr_read_reqq_used 0
# HELP aerospike_node_xdr_read_reqq_used_pct xdr read reqq used pct
# TYPE aerospike_node_xdr_read_reqq_used_pct gauge
aerospike_node_xdr_read_reqq_used_pct 0
# HELP aerospike_node_xdr_read_respq_used xdr read respq used
# TYPE aerospike_node_xdr_read_respq_used gauge
aerospike_node_xdr_read_respq_used 0
# HELP aerospike_node_xdr_read_success xdr read success
# TYPE aerospike_node_xdr_read_success counter
aerospike_node_xdr_read_success 2.021240944e+09
# HELP aerospike_node_xdr_read_txnq_used xdr read txnq used
# TYPE aerospike_node_xdr_read_txnq_used gauge
aerospike_node_xdr_read_txnq_used 0
# HELP aerospike_node_xdr_read_txnq_used_pct xdr read txnq used pct
# TYPE aerospike_node_xdr_read_txnq_used_pct gauge
aerospike_node_xdr_read_txnq_used_pct 0
# HELP aerospike_node_xdr_relogged_incoming Number of records relogged into this node's digest log by another node.
# TYPE aerospike_node_xdr_relogged_incoming counter
aerospike_node_xdr_relogged_incoming 1.53725338e+08
# HELP aerospike_node_xdr_relogged_outgoing Number of records relogged to another node's digest log.
# TYPE aerospike_node_xdr_relogged_outgoing counter
aerospike_node_xdr_relogged_outgoing 2.65063434e+08
# HELP aerospike_node_xdr_ship_bytes xdr ship bytes
# TYPE aerospike_node_xdr_ship_bytes counter
aerospike_node_xdr_ship_bytes 3.705335323714e+12
# HELP aerospike_node_xdr_ship_compression_avg_pct xdr ship compression avg pct
# TYPE aerospike_node_xdr_ship_compression_avg_pct gauge
aerospike_node_xdr_ship_compression_avg_pct -26.57
# HELP aerospike_node_xdr_ship_delete_success xdr ship delete success
# TYPE aerospike_node_xdr_ship_delete_success counter
aerospike_node_xdr_ship_delete_success 1.2787765e+08
# HELP aerospike_node_xdr_ship_destination_error xdr ship destination error
# TYPE aerospike_node_xdr_ship_destination_error counter
aerospike_node_xdr_ship_destination_error 7649
# HELP aerospike_node_xdr_ship_destination_permanent_error xdr ship destination permanent error
# TYPE aerospike_node_xdr_ship_destination_permanent_error counter
aerospike_node_xdr_ship_destination_permanent_error 0
# HELP aerospike_node_xdr_ship_fullrecord Number of records that did not take advantage of bin level shipping.
# TYPE aerospike_node_xdr_ship_fullrecord gauge
aerospike_node_xdr_ship_fullrecord 2.021187917e+09
# HELP aerospike_node_xdr_ship_inflight_objects xdr ship inflight objects
# TYPE aerospike_node_xdr_ship_inflight_objects gauge
aerospike_node_xdr_ship_inflight_objects 0
# HELP aerospike_node_xdr_ship_latency_avg xdr ship latency avg
# TYPE aerospike_node_xdr_ship_latency_avg gauge
aerospike_node_xdr_ship_latency_avg 95
# HELP aerospike_node_xdr_ship_outstanding_objects xdr ship outstanding objects
# TYPE aerospike_node_xdr_ship_outstanding_objects gauge
aerospike_node_xdr_ship_outstanding_objects 937
# HELP aerospike_node_xdr_ship_source_error xdr ship source error
# TYPE aerospike_node_xdr_ship_source_error counter
aerospike_node_xdr_ship_source_error 0
# HELP aerospike_node_xdr_ship_success xdr ship success
# TYPE aerospike_node_xdr_ship_success counter
aerospike_node_xdr_ship_success 2.149077709e+09
# HELP aerospike_node_xdr_throughput xdr throughput
# TYPE aerospike_node_xdr_throughput gauge
aerospike_node_xdr_throughput 705
# HELP aerospike_node_xdr_timelag xdr timelag
# TYPE aerospike_node_xdr_timelag gauge
aerospike_node_xdr_timelag 0
# HELP aerospike_node_xdr_unknown_namespace_error xdr unknown namespace error
# TYPE aerospike_node_xdr_unknown_namespace_error counter
aerospike_node_xdr_unknown_namespace_error 0
# HELP aerospike_ns_fail_xdr_forbidden fail xdr forbidden
# TYPE aerospike_ns_fail_xdr_forbidden counter
aerospike_ns_fail_xdr_forbidden{namespace="<NS1>"} 0
aerospike_ns_fail_xdr_forbidden{namespace="<NS2>"} 0
aerospike_ns_fail_xdr_forbidden{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_client_delete_error xdr client delete error
# TYPE aerospike_ns_xdr_client_delete_error counter
aerospike_ns_xdr_client_delete_error{namespace="<NS1>"} 0
aerospike_ns_xdr_client_delete_error{namespace="<NS2>"} 0
aerospike_ns_xdr_client_delete_error{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_client_delete_not_found xdr client delete not found
# TYPE aerospike_ns_xdr_client_delete_not_found counter
aerospike_ns_xdr_client_delete_not_found{namespace="<NS1>"} 0
aerospike_ns_xdr_client_delete_not_found{namespace="<NS2>"} 0
aerospike_ns_xdr_client_delete_not_found{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_client_delete_success xdr client delete success
# TYPE aerospike_ns_xdr_client_delete_success counter
aerospike_ns_xdr_client_delete_success{namespace="<NS1>"} 0
aerospike_ns_xdr_client_delete_success{namespace="<NS2>"} 0
aerospike_ns_xdr_client_delete_success{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_client_delete_timeout xdr client delete timeout
# TYPE aerospike_ns_xdr_client_delete_timeout counter
aerospike_ns_xdr_client_delete_timeout{namespace="<NS1>"} 0
aerospike_ns_xdr_client_delete_timeout{namespace="<NS2>"} 0
aerospike_ns_xdr_client_delete_timeout{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_client_write_error xdr client write error
# TYPE aerospike_ns_xdr_client_write_error counter
aerospike_ns_xdr_client_write_error{namespace="<NS1>"} 0
aerospike_ns_xdr_client_write_error{namespace="<NS2>"} 0
aerospike_ns_xdr_client_write_error{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_client_write_success xdr client write success
# TYPE aerospike_ns_xdr_client_write_success counter
aerospike_ns_xdr_client_write_success{namespace="<NS1>"} 0
aerospike_ns_xdr_client_write_success{namespace="<NS2>"} 0
aerospike_ns_xdr_client_write_success{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_client_write_timeout xdr client write timeout
# TYPE aerospike_ns_xdr_client_write_timeout counter
aerospike_ns_xdr_client_write_timeout{namespace="<NS1>"} 0
aerospike_ns_xdr_client_write_timeout{namespace="<NS2>"} 0
aerospike_ns_xdr_client_write_timeout{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_from_proxy_delete_error xdr from proxy delete error
# TYPE aerospike_ns_xdr_from_proxy_delete_error counter
aerospike_ns_xdr_from_proxy_delete_error{namespace="<NS1>"} 0
aerospike_ns_xdr_from_proxy_delete_error{namespace="<NS2>"} 0
aerospike_ns_xdr_from_proxy_delete_error{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_from_proxy_delete_not_found xdr from proxy delete not found
# TYPE aerospike_ns_xdr_from_proxy_delete_not_found counter
aerospike_ns_xdr_from_proxy_delete_not_found{namespace="<NS1>"} 0
aerospike_ns_xdr_from_proxy_delete_not_found{namespace="<NS2>"} 0
aerospike_ns_xdr_from_proxy_delete_not_found{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_from_proxy_delete_success xdr from proxy delete success
# TYPE aerospike_ns_xdr_from_proxy_delete_success counter
aerospike_ns_xdr_from_proxy_delete_success{namespace="<NS1>"} 0
aerospike_ns_xdr_from_proxy_delete_success{namespace="<NS2>"} 0
aerospike_ns_xdr_from_proxy_delete_success{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_from_proxy_delete_timeout xdr from proxy delete timeout
# TYPE aerospike_ns_xdr_from_proxy_delete_timeout counter
aerospike_ns_xdr_from_proxy_delete_timeout{namespace="<NS1>"} 0
aerospike_ns_xdr_from_proxy_delete_timeout{namespace="<NS2>"} 0
aerospike_ns_xdr_from_proxy_delete_timeout{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_from_proxy_write_error xdr from proxy write error
# TYPE aerospike_ns_xdr_from_proxy_write_error counter
aerospike_ns_xdr_from_proxy_write_error{namespace="<NS1>"} 0
aerospike_ns_xdr_from_proxy_write_error{namespace="<NS2>"} 0
aerospike_ns_xdr_from_proxy_write_error{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_from_proxy_write_success xdr from proxy write success
# TYPE aerospike_ns_xdr_from_proxy_write_success counter
aerospike_ns_xdr_from_proxy_write_success{namespace="<NS1>"} 0
aerospike_ns_xdr_from_proxy_write_success{namespace="<NS2>"} 0
aerospike_ns_xdr_from_proxy_write_success{namespace="<NS3>"} 0
# HELP aerospike_ns_xdr_from_proxy_write_timeout xdr from proxy write timeout
# TYPE aerospike_ns_xdr_from_proxy_write_timeout counter
aerospike_ns_xdr_from_proxy_write_timeout{namespace="<NS1>"} 0
aerospike_ns_xdr_from_proxy_write_timeout{namespace="<NS2>"} 0
aerospike_ns_xdr_from_proxy_write_timeout{namespace="<NS3>"} 0

❯ curl -s localhost:9146/metrics | grep dlog
# HELP aerospike_node_dlog_free_pct dlog free pct
# TYPE aerospike_node_dlog_free_pct gauge
aerospike_node_dlog_free_pct 100
# HELP aerospike_node_dlog_logged dlog logged
# TYPE aerospike_node_dlog_logged counter
aerospike_node_dlog_logged 6.836956637e+09
# HELP aerospike_node_dlog_overwritten_error dlog overwritten error
# TYPE aerospike_node_dlog_overwritten_error counter
aerospike_node_dlog_overwritten_error 0
# HELP aerospike_node_dlog_processed_link_down dlog processed link down
# TYPE aerospike_node_dlog_processed_link_down counter
aerospike_node_dlog_processed_link_down 0
# HELP aerospike_node_dlog_processed_main dlog processed main
# TYPE aerospike_node_dlog_processed_main counter
aerospike_node_dlog_processed_main 6.8369564e+09
# HELP aerospike_node_dlog_processed_replica dlog processed replica
# TYPE aerospike_node_dlog_processed_replica counter
aerospike_node_dlog_processed_replica 2.691862e+06
# HELP aerospike_node_dlog_relogged dlog relogged
# TYPE aerospike_node_dlog_relogged counter
aerospike_node_dlog_relogged 1.31656707e+08
# HELP aerospike_node_dlog_used_objects dlog used objects
# TYPE aerospike_node_dlog_used_objects gauge
aerospike_node_dlog_used_objects 713737

Copy link
Owner
@alicebob alicebob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's amazing. Thanks for the big update.

out if they have an ETA on native prometheus metrics.
Would be nice if they managed that!

@alicebob alicebob merged commit 2db4fae into alicebob:master May 12, 2020
@Xaelias
Copy link
Contributor Author
Xaelias commented May 12, 2020

Would be nice if they managed that!

Yeah no kidding :-D
It's hard to keep track of all the changes they make... Especially since at my job we don't upgrade our cluster that often.
I'll let you know when I know more :-)

@Xaelias
Copy link
Contributor Author
Xaelias commented May 12, 2020

Turns out it's already available in beta https://github.com/aerospike/aerospike-monitoring
I'll give it a look.
I kinda hoped aerospike would just expose a port with all the metrics instead of having to manage a sidecar still...

[EDIT] Ah asprom was just mentioned in the session :-D Apparently this exporter is "specifically designed to work with EE". I'm not sure what would make it not work with CE since apparently they rely on the info commands.

[EDIT2] Seem to work fine on Aerospike CE. And I quickly talked to one of their engineers, there shouldn't be any real blocker. All the dashboards they provide might be weird in CE (lots of metrics/functionalities that are not available/relevant). But assuming their exporter works well, we could probably start pointing people to their version assuming you run a recent enough version of the server.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0