Description
After experimenting with C-Vise and C-Reduce, I noticed the following aspects impede the reduction performance:
- Only one reduction heuristic is attempted at a time:
- this makes the performance very sensitive to the choice and ordering of heuristics;
- A single heuristic keeps being used even when it begins finding only very small reductions:
- while the user can intervene using a hotkey, this requires monitoring from an experienced human;
- When multiple reductions are discovered concurrently, only one is picked up:
- this implies the speed is bounded by the size of a single reduction, despite that many different reductions are often discovered quickly and they could typically be applied together safely;
- Some of the important heuristics rely on reformatting the input file:
- as this isn't 100% reliable, this requires spending additional time on the "sanity check", and in cases the reformatted input fails the heuristic's performance suffers severely;
- Integrating new heuristics is nontrivial:
- tight coupling with the codebase, the necessity to repeat common logic (binary search), in addition to the ordering/parallelism issues mentioned above, make it almost impractical despite potential speedups for particular use cases (new input language, lightweight grammar parsers, multi-file inputs, reducing command-line flags, etc.).
I'd like to propose the following changes in order to address these issues:
-
Change each heuristic to only generate descriptions of reduction possibilities ("hints") instead of directly modifying the input file.
- A hint may look like "delete bytes with indices X..Y", or "replace bytes with indices X..Y with text T", or a list of these, etc.
- Example for the "RemoveUnusedFunction" heuristic:
// vvv hint1 vvv void f() { int x = 42; } // ^^^ hint1 ^^^ class A { // vvv hint2 sub-item1 vvv void g(); // ^^^ hint2 sub-item1 ^^^ }; // vvv hint2 sub-item2 vvv void A::g() { } // ^^^ hint2 sub-item2 ^^^
- The actual protocol is TBD, but, for example, it could be JSON lines with items like
[{"l": 42, "r": 84}]
.
-
Allow the user to specify additional hint generation tools.
- Such a tool could be developed separately and using any language - all it needs is to adhere to the output format described above.
-
Implement a generic logic that attempts reductions by picking subsets of hints.
- Applying hints is a pure text editing task (abstracted away from the the reduction heuristic or language).
- This logic should also implement the same "binary search" idea as before (but living in a single place).
- Any additional ideas may be encapsulated here as well (e.g., adding randomness).
- As different hints may overlap, some basic logic for merging is needed (e.g., "the leftmost wins").
-
Whenever multiple successful reductions are discovered in parallel, attempt merging them as well ("merge train").
- The ideal goal is proceeding with all of the reductions found so far.
- As this isn't guaranteed to succeed in 100% of cases, the implementation should schedule this "merge train" job as yet another reduction attempt.
- Any additional ideas may be encapsulated here (like avoiding failed merge train as subsets).
-
Have an exit criteria for the merge train growing logic.
- I.e., once the current candidate stops growing quickly, we want to commit to it and go back to regenerating fresh hints.
Illustration (for visual clarity, it depicts the hint generation as a preliminary step; in reality we can start running tests on workers once we get first hints):
flowchart LR;
Input@{shape: doc} --> Heuristics
subgraph heuristics-subgraph [ ]
Heuristics@{shape: sm-circ}
Hints-1@{shape: doc}
Hints-2@{shape: doc}
Hints-K@{shape: doc}
Hints@{shape: docs, label: "Hints\n*(regrouped by type,\nsorted)*"}
end
Heuristics -->|heuristic-1| Hints-1
Heuristics -->|heuristic-2| Hints-2
Heuristics -->|heuristic-K| Hints-K
Hints-1 --> Hints@{shape: docs}
Hints-2 --> Hints
Hints-K --> Hints
Hints --> Scheduler@{shape: diamond, label: "**Job Scheduler**\n*(bin-srch over hints;\nform merge trains)*"}
subgraph workers-subgraph [ ]
Worker1@{label: "Worker-1\n*(merge&apply edits,\nrun test)*"}
Worker2@{label: "Worker-2\n..."}
Worker3@{label: "Worker-3\n..."}
WorkerN@{label: "Worker-N\n..."}
end
Scheduler <--> Worker1
Scheduler <--> Worker2
Scheduler <--> Worker3
Scheduler <--> WorkerN
Scheduler --> | train growth slow? | Commit@{label: "Commit\n*(merge&apply biggest\ntrain's edits to Input)*"}
Commit ----> Input
style Scheduler stroke-width:0
Understandably, this sounds like a big rework of C-Vise, so I'd like to understand if the project&community are open to this kind of change. The backup plan of starting (yet another) open-source reduction tool is always there, but I was thinking whether it makes sense to avoid adding more fragmentation to this solution space and to join efforts instead.
P.S. To verify the approach and measure the impact, I implemented a quick-and-dirty prototype of these ideas. The initial numbers from my workstation (--n=48
):
Test-case | Size | ToT C-Vise Reduction | hint-based C-Vise Reduction | Speed Up |
---|---|---|---|---|
GCC PR94937 | 8.5 MB | 210m | 30m | 7x |
GCC PR92516 | 6.5 MB | 70m | 11m | 6x |
Clang internal Google bug #383027690 | 4.6 MB | 150m | 6m | 25x |
Clang internal Google bug #363816643 | 10 MB | 105m | 20m |
5x |
Clang internal Google bug #321217557 | 97 MB | ~20000m (est.) | 220m | 90x |
Clang internal Google bug #355835505 (C++ header modules) | 37 MB, 2638 files | n/a | 45m | n/a |
There, the performance improvement varies, but especially big inputs seem to win a lot from this approach. Additionally, the last row shows how this approach scales to new use cases: a multi-file input, and a seamless integration of new heuristics (for reducing the number of files, for reducing command-line parameters, and for reducing C++ code that uses header modules).