Proposal: hint-based highly parallelizable, pluggable reduction

After experimenting with C-Vise and C-Reduce, I noticed the following aspects impede the reduction performance:

Only one reduction heuristic is attempted at a time:
- this makes the performance very sensitive to the choice and ordering of heuristics;
A single heuristic keeps being used even when it begins finding only very small reductions:
- while the user can intervene using a hotkey, this requires monitoring from an experienced human;
When multiple reductions are discovered concurrently, only one is picked up:
- this implies the speed is bounded by the size of a single reduction, despite that many different reductions are often discovered quickly and they could typically be applied together safely;
Some of the important heuristics rely on reformatting the input file:
- as this isn't 100% reliable, this requires spending additional time on the "sanity check", and in cases the reformatted input fails the heuristic's performance suffers severely;
Integrating new heuristics is nontrivial:
- tight coupling with the codebase, the necessity to repeat common logic (binary search), in addition to the ordering/parallelism issues mentioned above, make it almost impractical despite potential speedups for particular use cases (new input language, lightweight grammar parsers, multi-file inputs, reducing command-line flags, etc.).

I'd like to propose the following changes in order to address these issues:

Change each heuristic to only generate descriptions of reduction possibilities ("hints") instead of directly modifying the input file.
- A hint may look like "delete bytes with indices X..Y", or "replace bytes with indices X..Y with text T", or a list of these, etc.
- Example for the "RemoveUnusedFunction" heuristic:
```
// vvv  hint1  vvv
void f() {
  int x = 42;
}
// ^^^  hint1  ^^^

class A {
  // vvv  hint2 sub-item1  vvv
  void g();
  // ^^^  hint2 sub-item1  ^^^
};

// vvv  hint2 sub-item2  vvv
void A::g() {
}
// ^^^  hint2 sub-item2  ^^^
```
- The actual protocol is TBD, but, for example, it could be JSON lines with items like [{"l": 42, "r": 84}].
Allow the user to specify additional hint generation tools.
- Such a tool could be developed separately and using any language - all it needs is to adhere to the output format described above.
Implement a generic logic that attempts reductions by picking subsets of hints.
- Applying hints is a pure text editing task (abstracted away from the the reduction heuristic or language).
- This logic should also implement the same "binary search" idea as before (but living in a single place).
- Any additional ideas may be encapsulated here as well (e.g., adding randomness).
- As different hints may overlap, some basic logic for merging is needed (e.g., "the leftmost wins").
Whenever multiple successful reductions are discovered in parallel, attempt merging them as well ("merge train").
- The ideal goal is proceeding with all of the reductions found so far.
- As this isn't guaranteed to succeed in 100% of cases, the implementation should schedule this "merge train" job as yet another reduction attempt.
- Any additional ideas may be encapsulated here (like avoiding failed merge train as subsets).
Have an exit criteria for the merge train growing logic.
- I.e., once the current candidate stops growing quickly, we want to commit to it and go back to regenerating fresh hints.

Illustration (for visual clarity, it depicts the hint generation as a preliminary step; in reality we can start running tests on workers once we get first hints):

flowchart LR;
Input@{shape: doc} --> Heuristics
subgraph heuristics-subgraph [ ]
  Heuristics@{shape: sm-circ}
  Hints-1@{shape: doc}
  Hints-2@{shape: doc}
  Hints-K@{shape: doc}
  Hints@{shape: docs, label: "Hints\n*(regrouped by type,\nsorted)*"}
end
Heuristics -->|heuristic-1| Hints-1
Heuristics -->|heuristic-2| Hints-2
Heuristics -->|heuristic-K| Hints-K
Hints-1 --> Hints@{shape: docs}
Hints-2 --> Hints
Hints-K --> Hints
Hints --> Scheduler@{shape: diamond, label: "**Job Scheduler**\n*(bin-srch over hints;\nform merge trains)*"}
subgraph workers-subgraph [ ]
  Worker1@{label: "Worker-1\n*(merge&apply edits,\nrun test)*"}
  Worker2@{label: "Worker-2\n..."}
  Worker3@{label: "Worker-3\n..."}
  WorkerN@{label: "Worker-N\n..."}
end
Scheduler <--> Worker1
Scheduler <--> Worker2
Scheduler <--> Worker3
Scheduler <--> WorkerN
Scheduler --> | train growth slow? | Commit@{label: "Commit\n*(merge&apply biggest\ntrain's edits to Input)*"}
Commit ----> Input
style Scheduler stroke-width:0

Understandably, this sounds like a big rework of C-Vise, so I'd like to understand if the project&community are open to this kind of change. The backup plan of starting (yet another) open-source reduction tool is always there, but I was thinking whether it makes sense to avoid adding more fragmentation to this solution space and to join efforts instead.

P.S. To verify the approach and measure the impact, I implemented a quick-and-dirty prototype of these ideas. The initial numbers from my workstation (--n=48):

Test-case	Size	ToT C-Vise Reduction	hint-based C-Vise Reduction	Speed Up
GCC PR94937	8.5 MB	210m	30m	7x
GCC PR92516	6.5 MB	70m	11m	6x
Clang internal Google bug #383027690	4.6 MB	150m	6m	25x
Clang internal Google bug #363816643	10 MB	105m	20m ~~v1: 65m~~	5x
Clang internal Google bug #321217557	97 MB	~20000m (est.)	220m	90x
Clang internal Google bug #355835505 (C++ header modules)	37 MB, 2638 files	n/a	45m	n/a

There, the performance improvement varies, but especially big inputs seem to win a lot from this approach. Additionally, the last row shows how this approach scales to new use cases: a multi-file input, and a seamless integration of new heuristics (for reducing the number of files, for reducing command-line parameters, and for reducing C++ code that uses header modules).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions