8000 Proposal: hint-based highly parallelizable, pluggable reduction · Issue #169 · marxin/cvise · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Proposal: hint-based highly parallelizable, pluggable reduction #169
Open
@emaxx-google

Description

@emaxx-google

After experimenting with C-Vise and C-Reduce, I noticed the following aspects impede the reduction performance:

  • Only one reduction heuristic is attempted at a time:
    • this makes the performance very sensitive to the choice and ordering of heuristics;
  • A single heuristic keeps being used even when it begins finding only very small reductions:
    • while the user can intervene using a hotkey, this requires monitoring from an experienced human;
  • When multiple reductions are discovered concurrently, only one is picked up:
    • this implies the speed is bounded by the size of a single reduction, despite that many different reductions are often discovered quickly and they could typically be applied together safely;
  • Some of the important heuristics rely on reformatting the input file:
    • as this isn't 100% reliable, this requires spending additional time on the "sanity check", and in cases the reformatted input fails the heuristic's performance suffers severely;
  • Integrating new heuristics is nontrivial:
    • tight coupling with the codebase, the necessity to repeat common logic (binary search), in addition to the ordering/parallelism issues mentioned above, make it almost impractical despite potential speedups for particular use cases (new input language, lightweight grammar parsers, multi-file inputs, reducing command-line flags, etc.).

I'd like to propose the following changes in order to address these issues:

  1. Change each heuristic to only generate descriptions of reduction possibilities ("hints") instead of directly modifying the input file.

    • A hint may look like "delete bytes with indices X..Y", or "replace bytes with indices X..Y with text T", or a list of these, etc.
    • Example for the "RemoveUnusedFunction" heuristic:
      // vvv  hint1  vvv
      void f() {
        int x = 42;
      }
      // ^^^  hint1  ^^^
      
      class A {
        // vvv  hint2 sub-item1  vvv
        void g();
        // ^^^  hint2 sub-item1  ^^^
      };
      
      // vvv  hint2 sub-item2  vvv
      void A::g() {
      }
      // ^^^  hint2 sub-item2  ^^^
    • The actual protocol is TBD, but, for example, it could be JSON lines with items like [{"l": 42, "r": 84}].
  2. Allow the user to specify additional hint generation tools.

    • Such a tool could be developed separately and using any language - all it needs is to adhere to the output format described above.
  3. Implement a generic logic that attempts reductions by picking subsets of hints.

    • Applying hints is a pure text editing task (abstracted away from the the reduction heuristic or language).
    • This logic should also implement the same "binary search" idea as before (but living in a single place).
    • Any additional ideas may be encapsulated here as well (e.g., adding randomness).
    • As different hints may overlap, some basic logic for merging is needed (e.g., "the leftmost wins").
  4. Whenever multiple successful reductions are discovered in parallel, attempt merging them as well ("merge train").

    • The ideal goal is proceeding with all of the reductions found so far.
    • As this isn't guaranteed to succeed in 100% of cases, the implementation should schedule this "merge train" job as yet another reduction attempt.
    • Any additional ideas may be encapsulated here (like avoiding failed merge train as subsets).
  5. Have an exit criteria for the merge train growing logic.

    • I.e., once the current candidate stops growing quickly, we want to commit to it and go back to regenerating fresh hints.

Illustration (for visual clarity, it depicts the hint generation as a preliminary step; in reality we can start running tests on workers once we get first hints):

flowchart LR;
Input@{shape: doc} --> Heuristics
subgraph heuristics-subgraph [ ]
  Heuristics@{shape: sm-circ}
  Hints-1@{shape: doc}
  Hints-2@{shape: doc}
  Hints-K@{shape: doc}
  Hints@{shape: docs, label: "Hints\n*(regrouped by type,\nsorted)*"}
end
Heuristics -->|heuristic-1| Hints-1
Heuristics -->|heuristic-2| Hints-2
Heuristics -->|heuristic-K| Hints-K
Hints-1 --> Hints@{shape: docs}
Hints-2 --> Hints
Hints-K --> Hints
Hints --> Scheduler@{shape: diamond, label: "**Job Scheduler**\n*(bin-srch over hints;\nform merge trains)*"}
subgraph workers-subgraph [ ]
  Worker1@{label: "Worker-1\n*(merge&apply edits,\nrun test)*"}
  Worker2@{label: "Worker-2\n..."}
  Worker3@{label: "Worker-3\n..."}
  WorkerN@{label: "Worker-N\n..."}
end
Scheduler <--> Worker1
Scheduler <--> Worker2
Scheduler <--> Worker3
Scheduler <--> WorkerN
Scheduler --> | train growth slow? | Commit@{label: "Commit\n*(merge&apply biggest\ntrain's edits to Input)*"}
Commit ----> Input
style Scheduler stroke-width:0
Loading

Understandably, this sounds like a big rework of C-Vise, so I'd like to understand if the project&community are open to this kind of change. The backup plan of starting (yet another) open-source reduction tool is always there, but I was thinking whether it makes sense to avoid adding more fragmentation to this solution space and to join efforts instead.

P.S. To verify the approach and measure the impact, I implemented a quick-and-dirty prototype of these ideas. The initial numbers from my workstation (--n=48):

Test-case Size ToT C-Vise Reduction hint-based C-Vise Reduction Speed Up
GCC PR94937 8.5 MB 210m 30m 7x
GCC PR92516 6.5 MB 70m 11m 6x
Clang internal Google bug #383027690 4.6 MB 150m 6m 25x
Clang internal Google bug #363816643 10 MB 105m 20m v1: 65m 5x
Clang internal Google bug #321217557 97 MB ~20000m (est.) 220m 90x
Clang internal Google bug #355835505 (C++ header modules) 37 MB, 2638 files n/a 45m n/a

There, the performance improvement varies, but especially big inputs seem to win a lot from this approach. Additionally, the last row shows how this approach scales to new use cases: a multi-file input, and a seamless integration of new heuristics (for reducing the number of files, for reducing command-line parameters, and for reducing C++ code that uses header modules).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0