Possible performance gains #521

nsiccha · 2025-03-12T17:11:19Z

This is a low priority issue.

I've come across some simple Bayesian models for which Mooncake is significantly (~4times) slower than Enzyme or an alternative, very limited, Proof-Of-Concept Julia AD method (StanBlocksAD.jl). AFAICT, Mooncake should be able to reach Enzyme's/StanBlocksAD.jl performance. It's a bit unclear to me what exactly is "dragging Mooncake down".

Furthermore, for a batched version of that model, neither Enzyme nor Mooncake achieve the same scaling as StanBlocksAD.jl. To clarify/summarize, the timings relative to the scalar StanBlocksAD.jl/Enzyme.jl timing are roughly:

BATCH_TYPE        |         Float64  SReal{1, Float64}  SReal{2, Float64}  SReal{4, Float64}  SReal{8, Float64}  SReal{16, Float64}
=====             |         =====    =====              =====              =====              =====              =====
Primal            |         0.35     0.38               0.37               0.37               0.4                0.64
StanBlocksAD      |         1.0      0.99               1.1                1.2                1.6                2.7
Mooncake          |         4.5      4.6                4.6                12.0               21.0               35.0
Enzyme            |         1.0      2.7                3.1                3.4                4.5                7.8

Notebook with (slightly different) timings and potentially reproducible code: https://nsiccha.github.io/StanBlocksAD.jl/#why

I don't intend to continue developing StanBlocksAD.jl, but I find it interesting that there are apparently still possible performance gains for something purely Julian. We can discuss what StanBlocksAD.jl does differently than Mooncake and what if anything could be ported to Mooncake. But this issue is mainly meant to record this link, and to be revisited at some later point.

The text was updated successfully, but these errors were encountered:

yebai · 2025-03-12T22:09:02Z

@nsiccha, can you confirm that below is the target benchmarking function:

https://github.com/nsiccha/StanBlocksAD.jl/blob/16b1882d4a60b6eeaa4bf436d142cc1ebcb7399b/docs/index.qmd#L213-L229

nsiccha · 2025-03-13T06:51:52Z

@yebai, yes, exactly. The main work happens in the final StanBlocks.normal_lpdf call.

It uses this overwrite of the StanBlocks.normal_lpdf function to replace the Base.sum function by the StanBlocksAD.my_sum function, IIRC that made Mooncake and Enzyme a bit faster.

Using StanBlocks.constview here instead of a regular view (as commented out above) made all versions faster IIRC, because it avoids an allocation that Base.view apparently feels compelled to do.

yebai · 2025-03-21T10:38:16Z

@nsiccha, can you try to prepare an MWE that only depends on StanBlocks.jl and Julia's standard library so we can quickly analyse the cause here?

nsiccha · 2025-03-21T11:46:39Z

@yebai, of course, I'll try to do it next week.

nsiccha · 2025-03-28T12:21:17Z

Won't manage it this week, bite hopefully the next one :)

willtebbutt added the enhancement (performance) Would reduce the time it takes to run some bit of the code label Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible performance gains #521

Possible performance gains #521

Possible performance gains #521

Possible performance gains #521

Comments