Performance of Numeric.AD.Mode.Reverse.Double

I am using AD for gradient-based optimization and need better performance than I am currently getting. I noticed that some work has gone into improving the .Double specializations recently, so I did some experiments with the latest master (85aee3c). My setup is as follows:

{-# LANGUAGE BangPatterns #-}

import Criterion.Main
import Criterion.Types

import Numeric.AD.Mode.Forward
--import Numeric.AD.Mode.Forward.Double
--import Numeric.AD.Mode.Reverse
--import Numeric.AD.Mode.Reverse.Double

{-# INLINE poly #-}
poly :: Num a => a -> a
poly x = go (100000 :: Int) 0 where
    go 0 !a = a
    go n !a = go (n - 1) (a + x ^ n)

main :: IO ()
main = defaultMainWith config [beval, bdiff] where
    config = defaultConfig { regressions = [(["iters"], "allocated")] }
    p = 1.2 :: Double
    beval = bench "eval" $ whnf poly p
    bdiff = bench "diff" $ whnf (diff poly) p

I am using ghc 8.10.5 and llvm 12.0.1 and compiled with -O2 -fllvm. I also set the +ffi switch for the ad package.

I get the following results (full details):

Mode                              Evaluate (alloc)    Differentiate (alloc)
---------------------------------------------------------------------------
Numeric.AD.Mode.Forward           4.107 ms (16 B)     521.4 ms (766.1 MB)
Numeric.AD.Mode.Forward.Double    4.147 ms (16 B)     5.119 ms (88 B)
Numeric.AD.Mode.Reverse           4.638 ms (16 B)     1.168 s  (1.480 GB)
Numeric.AD.Mode.Reverse.Double    4.658 ms (16 B)     220.2 ms (770.2 MB)

Using NCG instead of LLVM, the results are similar, with slightly longer execution times. I am not sure why regular evaluation times also change with different modes.

I am very happy with Numeric.AD.Mode.Forward.Double, as it causes barely any overhead over regular evaluation.

While Numeric.AD.Mode.Reverse.Double is significantly faster than its generic counterpart, its 50x slowdown is still a long shot from the promise of "automatic differentiation typically only decreases performance by a small multiplier". In particular, it allocates a lot of intermediate memory. Since the reverse mode tape is implemented in C via FFI (which I presume is not counted by haskell's GC), I suspect that the 770MB that are allocated indicate that there is still some boxing going on.

Since I am doing gradient-based optimization, I would like to use reverse mode. Am I doing something wrong here? Is there something that can be done to bring its performance more in line with how Numeric.AD.Mode.Forward.Double behaves? Or is this simply a consequence of the additional complexity and bookkeeping of reverse mode ad that just cannot be avoided and is only justified by its better performance for gradients of high dimensionality?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions