Darum? Ach, warum!
A set of tools to detect and help diagnose brittle verification.
A quick summary follows. Other options:
- A gentler introduction in the form of a blog post.
- A walkthrough of using DARUM on a real project.
- More technical details.
It analyzes Dafny's verification costs to find brittleness and helps the user diagnose it.
Dafny verifies code by translating it internally into assertions, that then are verified by the Z3 solver.
For a long time, the common advice to help Dafny code verify successfully was to add context information to the code through manual assertions, so that Z3 would have enough information to find a solution. This works well for small projects. However, the solver can eventually have too much information and get lost following unproductive rabbitholes while searching for a solution. This causes long verification times or timeouts, which bogs down development. Worse, the verification of Dafny code depends on a measure of randomness, which causes Z3 to follow changing paths when searching for proofs, even for semantically unchanged Dafny code. This causes variability in verification times, possibly spanning multiple orders of magnitude. This is known as "verification brittleness".
Since recently, Dafny has been adding functionalities to help the user control what information reaches Z3. The problem then is that one rarely knows what allowed Z3 find a proof in the past, or what is confusing it now. Indeed, at the Dafny level, the only thing we get back is the result (verification successful, failure or timeout), plus how costly was it for the solver to reach that result.
At Consensys, we found that the random distribution of verification costs in bigger codebases turns heavily multimodal. Hence, the standard statistics (average, covariance) typically used to report costs obscure important information; namely, that cost variation is not smooth, but can change drastically. And because of the inherent randomness, these variations happen for no reason discernible to the user.
Darum analyzes the solver costs to find what parts of the code show the highest variability. It's at these points where applying Dafny's brittleness control functionalities will have the biggest effect.
Darum consists of 3 related tools:
-
dafny_measure
: a wrapper arounddafny measure-complexity
for easier management of the iterative verification process. It captures the logs and augments them with context information for easier bookkeeping:- The timestamp and arguments used in this verification run
- Dafny's stdout and stderr
- The input file's contents and a hash, to avoid confusion when comparing successive versions of the code
Additionally,
dafny_measure
warns when some Dafny toolchain bugs are detected, like when it gets stuck (#5316) or z3 processes are leaked (#5616). -
plot_distribution
: a tool to analyze and present the logs generated bydafny_measure
.- It runs some tests on the verification results contained in the log
- Scores the results heuristically for their potential for improvement
- Presents the results in summary tables
- Plots the most interesting results for further analysis.
-
compare_distribution
: a tool to compare verification runs at different Assertion Batch granularity, i.e. with and without "Isolate Assertions" mode.
Darum's tools are written in Python and available in Pypi.
Probably the easiest way to install DARUM is using pipx
, which should be available in all common package managers, like brew
in macOS.
$ brew install pipx
...
$ pipx install darum
This will make DARUM's tools available as common CLI commands.
Eventually you can upgrade DARUM with
$ pipx upgrade darum
You can find a full workflow walkthrough here.
In general, the workflow will be:
- Run
dafny_measure
- Plot the logfile with
plot_distribution
(happens automatically when runningdafny_measure
) - If some member looks interesting/suspicious, run
dafny_measure
again in IA mode, possibly with--filter-symbol
to focus only on that member - Plot the new logfile with
plot_distribution
- And/or compare both logfiles with
compare_distribution
.
$ dafny_measure myfile.dfy
...
Generated logfile XYZ.log
$ plot_distribution XYZ.log
...
$ dafny_measure myfile.dfy --isolate-assertions --extra "--filter-symbol expensiveFunction"
...
$ compare_distribution XYZ.log -i XYZ_IA.log
...
Each of the tools offers a --help
argument that lists the available options.
For further details about how Darum works and usage strategies, please see the file Details.md.
The default is 10, which in practice seem to work well for quick tests. Bigger numbers (100 mutations or more) are useful to get more detail on how the distribution really looks like in badly behaved code: what are its modes, and how extreme it can get.
Note that a higher number of mutations can trigger bugs in Dafny, and fail midway without producing a log (Dafny issue #5316). Plan accordingly. To work around this, there's some functionality in Darum to analyze multiple small logfiles, which can be more reliable than trying to generate a big logfile at once. (Darum issue #1)
Covered in the walkthrough.
ABs are scored according to their characteristics, including the fact that a non-successful AB makes subsequent ABs in the same member unreliable.
The top N ABs are plotted. For plots with failures/OoRs, these are plotted aside for attention.
The plot starts in transparent mode to make it easier to see where bars overlap. Clicking on the legend makes the corresponding plot easier to see.
Verification results that happen rarely are specially important. Hence, the Y axis is logarithmic to better show those single, rare results.
This type of plot compares RUs of members verified in default mode (plotted with X symbols) against the RU total of those same members verified in IA mode (plotted as spikes). The comparisons are always member-wise: member-wide single AB vs sum of a member's ABs. Note the Log X axis.
Things to notice in each member:
- Ideal behavior happens when a member verifies robustly and, ideally, cheaply. In the plot, this translates to Xs clustered together (meaning stability) and much farther to the left (meaning cheaper) than the spikes, which will also be clustered together (meaning that even the DAs verify stably for this member). This is the typical situation for short, simple members.
- Wide X dispersion means brittleness in default mode. If the spikes are still clustered, this means that the DAs are still stable.
- Maybe Z3 is finding a proof following different paths, or the search space is just too big. Reducing it to force Z3 into the fast path would help.
- Notably, if some of the X show higher cost than the spikes, this is a clear indication that plainly breaking down the AB into smaller batches would help by trading happenstance speed for stability.
- If even the spikes are dispersed, this means that even the individual DAs show brittleness.
- It's likely that some DA can't verify by itself; the member-level AB verified successfully because it discovered extra context while proving previous DAs. Hence, it'd be helpful to make the discovered context explicit. Plotting individually the IA mode log will show what DA is failing.
- Robust member verification (clustered X) with brittle IA verification (disperse spikes) might point to brittleness waiting to happen!
- Lucky, side-effect discoveries while proving early DAs in the AB are allowing subsequent DAs to be proven. If the lucky discovery stops happening, suddenly there are failures in the AB.
Default mode distribution plots contain a table analyzing the logs. For convenience, the column "diag" in the table summarizes the situation for each row with a series of icons:
- 📊 : This element was plotted because was among the top scores.
- ⌛️ : This element finished as Out of Resources.
- ❌ : This element explicitly failed verification.
- ❗️ : This element had both successful and failed/OoR results. Note that purely successful results are not highlighted because they're the hoped default; but fliflopping results merit extra attention. According to the Dafny developers, as long as there is a successful result, the goal should be to stabilize it to avoid the failures – as opposed to discarding the success because of the failures.
- ❓ : This element had a single success across all iterations. It's probably prioritary to stabilize its success.
- ⛔️ : This element was excluded from the plot on request.
IA mode distribution plots contain 2 tables. The first one is equivalent to the one just described, only applied to the individual ABs. The second table shows the total costs at the member level, but still in IA mode.
If interesting/atypical situations were detected while preparing the plots, they will be listed in a Comments section at the bottom of the plot page.
As mentioned, a member can verify stably in default mode, but in IA mode present ABs that are brittle or even fail. The significance of this situation is still unclear, therefore:
- Probably there's no immediate harm in leaving the member as it is. However, you might still want to keep an eye on it in case that any small change in the member triggers brittleness.
- More importantly, It's probably best to focus on fixing problems that can be first be found at the member level with standard verification mode. On the contrary, starting by looking for problems at the IA mode might cause you to spend effort on trouble that maybe isn't really there. Dafny's
--filter-symbol
is a great way to avoid the temptation of fishing for unnecessary trouble in IA mode.
To minimize noise, once an AB fails in IA mode, we ignore subsequent ABs in the same iteration. In such a case, the tables will show that some ABs have a smaller number of results than previous ones. In the extreme case of only 1 result remaining for a given AB, it will be tagged with the ❗️ icon.
Sometimes interesting mutations happen very rarely. Combined with Dafny's difficulties to run long series of mutations, it can be difficult to get a log that captures a rare mutation. DARUM's multifile support will help eventually. Until then, one trick is to take note of the random seed that caused the interesting mutation (reported in Dafny's stdout and logs), and run a new verification run starting in that seed using the --rseed
argument to dafny_measure
.
Since version 4.4, Dafny includes a standard library that provides pre-verified code. This library includes helper lemmas for things like bitwise manipulation and non-linear arithmetic, which are typical verification pain points. Using this library instead of implementing one's own version might save much work; or, if a reimplementation is needed, the library can at least offer examples of how things were implemented by Dafny's own developers.
See here.
Clone the repo to your system and install it in editable mode with pipx, poetry or similar tools.
cd darum
pipx install -e .