Description
I recently did an audit of all places where the @ovn_stats.timeit
decorator is used. What I found was that the only truly useful places are in:
WorkerNode.wait()
: After provisioning a worker node, this waits for the correspondingChassis
record to appear in the SBDB.WorkerNode.ping_port()
: This determines how long it takes before pings are successful [1] .
The rest of the timed operations essentially call ovn-nbctl
a bunch of times. Since we use the ovn-nbctl
daemon on the central nodes, any ovn-nbctl
command is likely to complete near-instantly, especially write-only operations. Therefore, what we're really timing here is the roundtrip time for SSH command execution, not OVN. As an example look at the Namespace.add_ports
graph in the comment below this one. Aside from an oddity on iteration 17, the iterations take around a tenth of a second. This is because all that's being measured here are some ovn-nbctl calls that add addresses to an address set and adds ports to a port group.
The oddity at iteration 17 is interesting, but it runs afoul of a second issue with ovn-tester's timers: the timers do a poor job of illustrating the bottleneck(s). If you look at that graph, can you determine why iteration 17 took 25 seconds instead of a tenth of a second? That could be due to a network error that caused one or more SSH commands to take multiple attempts. Or it may be that the ovn-nbctl daemon got disconnected from the NBDB temporarily and had to reconnect, pulling down thousands of records and delaying the execution of a queued command. Or it could be something else entirely.
This problem also extends to the "useful" timers I mentioned above. When we time WorkerNode.ping_port()
, This includes the SSH connection overhead, plus python client code execution (such as multiple datetime.now()
calls). Therefore, if we see an oddity in a graph, it's difficult to pin the blame directly on OVN. We could have just lost network connectivity between ovn-tester and the worker node, for instance.
How can we fix this? There are a few steps to take:
- Determine what measurements we actually care about. There are several things we are measuring right now that likely can just be removed.
- Ensure we have the needed tools to measure what we want. As an example, let's say we want to determine the time it takes from when a logical switch port is created to when it is installed on a node. We would probably want to take the following measurements:
- Timestamp when ovn-tester issued the command to add the logical switchport
- Timestamp when the logical switch port was added to the NBDB
- Timestamp when the corresponding Port Binding was added to the SBDB
- Timestamp when the corresponding Interface in OVS has
external_ids:ovn-installled
set. - (Possibly) Timestamp when certain expected flows are installed.
Right now, there is nothing that can monitor the databases in real time and note the necessary timestamps.
- Change the ovn-tester code to not have waits in the middle of tests. If we have the tools to measure times appropriately, then ovn-tester can operate by hammering OVN with commands. After the test is complete, we can then gather the statistics and log them.
[1] The usefulness of timing pings may disappear when asyncio code is merged. The asyncio PR removes all pings that were used to determine when a port comes up. Instead, it uses ovn-installed-ts
. The remaining use case for pings is for testing ACLs, and in that case, the fact the pings succeed is what's important, not how long it takes for the pings to start succeeding.