Description
I am using AD for gradient-based optimization and need better performance than I am currently getting. I noticed that some work has gone into improving the .Double
specializations recently, so I did some experiments with the latest master (85aee3c). My setup is as follows:
{-# LANGUAGE BangPatterns #-}
import Criterion.Main
import Criterion.Types
import Numeric.AD.Mode.Forward
--import Numeric.AD.Mode.Forward.Double
--import Numeric.AD.Mode.Reverse
--import Numeric.AD.Mode.Reverse.Double
{-# INLINE poly #-}
poly :: Num a => a -> a
poly x = go (100000 :: Int) 0 where
go 0 !a = a
go n !a = go (n - 1) (a + x ^ n)
main :: IO ()
main = defaultMainWith config [beval, bdiff] where
config = defaultConfig { regressions = [(["iters"], "allocated")] }
p = 1.2 :: Double
beval = bench "eval" $ whnf poly p
bdiff = bench "diff" $ whnf (diff poly) p
I am using ghc 8.10.5 and llvm 12.0.1 and compiled with -O2 -fllvm
. I also set the +ffi
switch for the ad package.
I get the following results (full details):
Mode Evaluate (alloc) Differentiate (alloc)
---------------------------------------------------------------------------
Numeric.AD.Mode.Forward 4.107 ms (16 B) 521.4 ms (766.1 MB)
Numeric.AD.Mode.Forward.Double 4.147 ms (16 B) 5.119 ms (88 B)
Numeric.AD.Mode.Reverse 4.638 ms (16 B) 1.168 s (1.480 GB)
Numeric.AD.Mode.Reverse.Double 4.658 ms (16 B) 220.2 ms (770.2 MB)
Using NCG instead of LLVM, the results are similar, with slightly longer execution times. I am not sure why regular evaluation times also change with different modes.
I am very happy with Numeric.AD.Mode.Forward.Double
, as it causes barely any overhead over regular evaluation.
While Numeric.AD.Mode.Reverse.Double
is significantly faster than its generic counterpart, its 50x slowdown is still a long shot from the promise of "automatic differentiation typically only decreases performance by a small multiplier". In particular, it allocates a lot of intermediate memory. Since the reverse mode tape is implemented in C via FFI (which I presume is not counted by haskell's GC), I suspect that the 770MB that are allocated indicate that there is still some boxing going on.
Since I am doing gradient-based optimization, I would like to use reverse mode. Am I doing something wrong here? Is there something that can be done to bring its performance more in line with how Numeric.AD.Mode.Forward.Double
behaves? Or is this simply a consequence of the additional complexity and bookkeeping of reverse mode ad that just cannot be avoided and is only justified by its better performance for gradients of high dimensionality?