-
Notifications
You must be signed in to change notification settings - F 8000 ork 12
Document Loop Optimisation Opportunities #156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I still need to add more concrete info about what loop optimisations are possible, but here's a summary of the state of affairs currently:
That we just rely on everything boiling down to the same kind of looping structure in the CFG is a great advantage of this approach -- basically everything CPU-based that’s performant gets reduced to a loop in the CFG (specifically, a thing called a “Natural Loop” in compiler optimisation terminology). There are well-established optimisation strategies for loops, so we don’t need to implement separate rules for all the different higher-order functions to get good performance, nor do we need to tell people to steer clear of writing for or while loops. (The situation in which this strategy breaks down is if people use |
Tapir.jl does not perform as well as it could on functions like the following: function foo!(y::Vector{Float64}, x::Vector{Float64})
@inbounds @simd for n in eachindex(x)
y[n] = y[n] + x[n]
end
return y
end For example, on my computer: y = randn(4096)
x = randn(4096)
julia> @benchmark foo!($y, $x)
BenchmarkTools.Trial: 10000 samples with 173 evaluations.
Range (min … max): 547.150 ns … 3.138 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 646.633 ns ┊ GC (median): 0.00%
Time (mean ± σ): 682.488 ns ± 116.548 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄██▂
▁▁▂▄▇████▇▇▇▆▆▅▅▅▅▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
547 ns Histogram: frequency by time 1.18 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
rule = Tapir.build_rrule(foo!, y, x);
foo!_d = zero_fcodual(foo!)
y_d = zero_fcodual(y)
x_d = zero_fcodual(x)
out, pb!! = rule(foo!_d, y_d, x_d);
julia> @benchmark ($rule)($foo!_d, $y_d, $x_d)[2](NoRData())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 64.042 μs … 202.237 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 78.675 μs ┊ GC (median): 0.00%
Time (mean ± σ): 75.763 μs ± 10.175 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▇ ▇ ▇ ▄▂ ▅ ▃ ▆ ▅▆ █▄ ▄ ▁▂ ▂ ▁ ▂ ▃
█▃█▃█▄██▁▄▁▆▄▁▁█▄█▆██████████▇▄██▃▅█▃▃█▆▆▇█▆▆▆▇▆▄▁█▃▃▃▅█▅▃▁█ █
64 μs Histogram: log(frequency) by time 108 μs <
Memory estimate: 0 bytes, allocs estimate: 0. So the performance ratio is roughly Note that this is not due to type-instabilities. One way to convince yourself of this is that there are no allocations required to run AD, which would most certainly not be the case were there type instabilities. To see this, take a look at the optimised IR for 2 1 ── %1 = Base.arraysize(_3, 1)::Int64 │╻╷╷╷╷ macro expansion
│ %2 = Base.slt_int(%1, 0)::Bool ││╻╷╷╷╷ eachindex
│ %3 = Core.ifelse(%2, 0, %1)::Int64 │││╻ axes1
│ %4 = %new(Base.OneTo{Int64}, %3)::Base.OneTo{Int64} ││││┃││││ axes
└─── goto #14 if not true │╻ macro expansion
2 ── %6 = Base.slt_int(0, %3)::Bool ││╻ <
└─── goto #12 if not %6 ││
3 ── nothing::Nothing │
4 ┄─ %9 = φ (#3 => 0, #11 => %27)::Int64 ││
│ %10 = Base.slt_int(%9, %3)::Bool ││╻ <
└─── goto #12 if not %10 ││
5 ── %12 = Base.add_int(%9, 1)::Int64 ││╻╷ simd_index
└─── goto #9 if not false │││╻ getindex
6 ── %14 = Base.slt_int(0, %12)::Bool ││││╻ >
│ %15 = Base.sle_int(%12, %3)::Bool ││││╻ <=
│ %16 = Base.and_int(%14, %15)::Bool ││││╻ &
└─── goto #8 if not %16 ││││
7 ── goto #9 │
8 ── invoke Base.throw_boundserror(%4::Base.OneTo{Int64}, %12::Int64)::Union{}
└─── unreachable ││││
9 ┄─ goto #10 │
10 ─ goto #11 │
11 ─ %23 = Base.arrayref(false, _2, %12)::Float64 ││╻╷ macro expansion
│ %24 = Base.arrayref(false, _3, %12)::Float64 │││┃ getindex
│ %25 = Base.add_float(%23, %24)::Float64 │││╻ +
│ Base.arrayset(false, _2, %25, %12)::Vector{Float64} │││╻ setindex!
│ %27 = Base.add_int(%9, 1)::Int64 ││╻ +
│ $(Expr(:loopinfo, Symbol("julia.simdloop"), nothing))::Nothing │╻ macro expansion
└─── goto #4 ││
12 ┄ goto #14 if not false ││
13 ─ nothing::Nothing │
5 14 ┄ return _2 │ The performance-critical chunk of the loop happens between %23_ = rrule!!(zero_fcodual(Base.arrayref), zero_fcodual(false), _2, %12)
%23 = %23[1]
push!(%23_pb_stack, %23[2]) In short, we run the rule, pull out the first element of the result, and push the pullback to the stack for use on the reverse-pass. So there is at least one really large obvious source of overhead here: pushing to / popping from the stacks. If you take a look at the pullbacks for the
This information is necessary for AD, but
What's not obvious here, but is also important, is that the call to Now, Tapir.jl is implemented in such a way that, if the pullback for a particular function is a singleton / doesn't carry around any information, the associated pullback stack is eliminated entirely. Moreover, just reducing the amount of memory stored at each iteration should reduce memory pressure. Consequently, a good strategy for making progress is to figure out how to reduce the amount of stuff which gets stored in the pullback stacks. The two points noted above provide obvious starting points. Making use of loop invariantsIn short: ammend the rule interface such that the arguments to the forwards pass are also made available on the reverse pass. For example, the function rrule!!(::CoDual{typeof(arrayref)}, inbounds::CoDual{Bool}, x::CoDual{Vector{Float64}}, ind::CoDual{Int})
_ind = primal(ind)
dx = tangent(x)
function arrayref_pullback(dy)
dx[_ind] += dy
return NoRData(), NoRData(), NoRData(), NoRData()
end
return CoDual(primal(x)[_ind], tangent(x)[_ind]), arrayref_pullback
end This skips some details, but the important point is that Under the new interface, this would look something like function rrule!!(::CoDual{typeof(arrayref)}, inbounds::CoDual{Bool}, x::CoDual{Vector{Float64}}, ind::CoDual{Int})
function arrayref_pullback(dy, ::CoDual{typeof(arrayref)}, ::CoDual{Bool}, x::CoDual{Vector{Float64}}, ind::CoDual{Int})
_ind = primal(ind)
dx = tangent(x)
dx[_ind] += dy
return NoRData(), NoRData(), NoRData(), NoRData()
end
return CoDual(primal(x)[_ind], tangent(x)[_ind]), arrayref_pullback
end In this version of the rule, So this interface change frees up Tapir.jl to provide the arguments on the reverse-pass in whichever way it pleases. In this particular example, both It's impossible to know for sure how much of an effect this would have, but doing this alone would more than halve the memory requirement for Induction Variable AnalysisI won't address how we could make use of induction variable analysis here because I'm still trying to get my head around exactly how is easiest to go about it. |
Another obvious optimisation is to analyse the trip count, and pre-allocate the (necessary) pullback stacks in order to avoid branching during execution (i.e. checking that they're long enough to store the next pullback, and allocating more memory if not). This is related to induction variable analysis, so we'd probably want to do that first. Doing this kind of optimisation would enable vectorisation to happen more effectively in AD, as would could completely eliminate branching from a number of tight loops. |
Good investigations; it's probably okay to keep this issue open instead of transferring discussions here into docs. |
A Compiler-Friendly Implementation of
|
Some Profiling Results@yebai @sunxd3 : as promised the other week, here are some benchmarking numbers to give you a rough sense of how much time is spent doing "book-keeping" vs "interesting work". I'll leave you to extrapolate my comments in the SetupTo replicate, first activate the function prep_test_case(fargs...)
rule = Mooncake.build_rrule(fargs...)
coduals = map(zero_codual, fargs)
return rule, coduals
end
function run_many_times(N, f, args::Vararg{Any, P}; kwargs...) where {P}
@inbounds for _ in 1:N
f(args...; kwargs...)
end
return nothing
end You may need to run the
|
Per our discussions, here's a simple hand-derived example in which handling loop-invariants + induction variables correctly should be enough to get us good performance. Consider the following simple implementation of sum: function _sum(f::F, x::AbstractArray{<:Real}) where {F}
y = 0.0
n = 0
while n < length(x)
n += 1
y += f(x[n])
end
return y
end On Julia LTS, it has the following IRCode: julia> Base.code_ircode_by_type(Tuple{typeof(_sum), typeof(identity), Vector{Float64}})
1-element Vector{Any}:
1 ─ nothing::Nothing │
7 2 ┄ %2 = φ (#1 => 0, #3 => %7)::Int64 │
│ %3 = φ (#1 => 0.0, #3 => %9)::Float64 │
│ %4 = Base.arraylen(_3)::Int64 │╻ length
│ %5 = Base.slt_int(%2, %4)::Bool │╻ <
└── goto #4 if not %5 │
8 3 ─ %7 = Base.add_int(%2, 1)::Int64 │╻ +
9 │ %8 = Base.arrayref(true, _3, %7)::Float64 │╻ getindex
│ %9 = Base.add_float(%3, %8)::Float64 │╻ +
10 └── goto #2 │
11 4 ─ return %3 │
=> Float64 It cycles between blocks 2 and 3 to perform the loop, exiting via block 4. In block 2, SSA values To understand what book-keeping is happening, it is enough to understand the forwards-pass IR generated by Mooncake. Our goal will be to eliminate the need for various book-keeping data structures in the forwards-pass -- once they're gone from here, they'll disappear from the reverse-pass also. The fowards-pass IR generated by Mooncake is 5 1 ─ %1 = (Mooncake.get_shared_data_field)(_1, 1)::Mooncake.Stack{Int32} │
│ %2 = (Mooncake.get_shared_data_field)(_1, 2)::Base.RefValue{Tuple{Mooncake.LazyZeroRData{typeof(_sum), Nothing}, Mooncake.LazyZeroRData{typeof(identity), Nothing}, Mooncake.LazyZeroRData{Vector{Float64}, Nothing}}}
│ %3 = (Mooncake.get_shared_data_field)(_1, 3)::Mooncake.Stack{Tuple{Mooncake.var"#arrayref_pullback!!#635"{1, Vector{Float64}, Int64}}}
└── (Mooncake.__assemble_lazy_zero_rdata)(%2, _2, _3, _4)::Core.Const((Mooncake.LazyZeroRData{typeof(_sum), Nothing}(nothing), Mooncake.LazyZeroRData{typeof(identity), Nothing}(nothing), Mooncake.LazyZeroRData{Vector{Float64}, Nothing}(nothing)))
2 ─ (Mooncake.__push_blk_stack!)(%1, 11)::Core.Const(nothing) │
3 ┄ %6 = φ (#2 => Mooncake.CoDual{Int64, NoFData}(0, NoFData()), #4 => %19)::Mooncake.CoDual{Int64, NoFData} │
│ %7 = φ (#2 => Mooncake.CoDual{Float64, NoFData}(0.0, NoFData()), #4 => %28)::Mooncake.CoDual{Float64, NoFData} │
│ %8 = (identity)(_4)::Mooncake.CoDual{Vector{Float64}, Vector{Float64}} │
│ %9 = (Mooncake.rrule!!)($(QuoteNode(Mooncake.CoDual{typeof(Mooncake.IntrinsicsWrappers.arraylen), NoFData}(Mooncake.IntrinsicsWrappers.arraylen, NoFData()))), %8)::Tuple{Mooncake.CoDual{Int64, NoFData}, Mooncake.NoPullback{Tuple{Mooncake.LazyZeroRData{typeof(Mooncake.IntrinsicsWrappers.arraylen), Nothing}, Mooncake.LazyZeroRData{Vector{Float64}, Nothing}}}}
│ %10 = (getfield)(%9, 1)::Mooncake.CoDual{Int64, NoFData} │
│ %11 = (identity)(%6)::Mooncake.CoDual{Int64, NoFData} │
│ %12 = (identity)(%10)::Mooncake.CoDual{Int64, NoFData} │
│ %13 = (Mooncake.rrule!!)($(QuoteNode(Mooncake.CoDual{typeof(Mooncake.IntrinsicsWrappers.slt_int), NoFData}(Mooncake.IntrinsicsWrappers.slt_int, NoFData()))), %11, %12)::Tuple{Mooncake.CoDual{Bool, NoFData}, Mooncake.NoPullback{Tuple{Mooncake.LazyZeroRData{typeof(Mooncake.IntrinsicsWrappers.slt_int), Nothing}, Mooncake.LazyZeroRData{Int64, Nothing}, Mooncake.LazyZeroRData{Int64, Nothing}}}}
│ %14 = (getfield)(%13, 1)::Mooncake.CoDual{Bool, NoFData} │
│ %15 = (Mooncake.primal)(%14)::Bool │
└── goto #5 if not %15 │
4 ─ %17 = (identity)(%6)::Mooncake.CoDual{Int64, NoFData} │
│ %18 = (Mooncake.rrule!!)($(QuoteNode(Mooncake.CoDual{typeof(Mooncake.IntrinsicsWrappers.add_int), NoFData}(Mooncake.IntrinsicsWrappers.add_int, NoFData()))), %17, $(QuoteNode(Mooncake.CoDual{Int64, NoFData}(1, NoFData()))))::Tuple{Mooncake.CoDual{Int64, NoFData}, Mooncake.NoPullback{Tuple{Mooncake.LazyZeroRData{typeof(Mooncake.IntrinsicsWrappers.add_int), Nothing}, Mooncake.LazyZeroRData{Int64, Nothing}, Mooncake.LazyZeroRData{Int64, Nothing}}}}
│ %19 = (getfield)(%18, 1)::Mooncake.CoDual{Int64, NoFData} │
│ %20 = (identity)(_4)::Mooncake.CoDual{Vector{Float64}, Vector{Float64}} │
│ %21 = (identity)(%19)::Mooncake.CoDual{Int64, NoFData} │
│ %22 = (Mooncake.rrule!!)($(QuoteNode(Mooncake.CoDual{typeof(Core.arrayref), NoFData}(Core.arrayref, NoFData()))), $(QuoteNode(Mooncake.CoDual{Bool, NoFData}(true, NoFData()))), %20, %21)::Tuple{Mooncake.CoDual{Float64, NoFData}, Mooncake.var"#arrayref_pullback!!#635"{1, Vector{Float64}, Int64}}
│ %23 = (getfield)(%22, 1)::Mooncake.CoDual{Float64, NoFData} │
│ %24 = (getfield)(%22, 2)::Mooncake.var"#arrayref_pullback!!#635"{1, Vector{Float64}, Int64} │
│ %25 = (identity)(%7)::Mooncake.CoDual{Float64, NoFData} │
│ %26 = (identity)(%23)::Mooncake.CoDual{Float64, NoFData} │
│ %27 = (Mooncake.rrule!!)($(QuoteNode(Mooncake.CoDual{typeof(Mooncake.IntrinsicsWrappers.add_float), NoFData}(Mooncake.IntrinsicsWrappers.add_float, NoFData()))), %25, %26)::Tuple{Mooncake.CoDual{Float64, NoFData}, Mooncake.IntrinsicsWrappers.var"#add_float_pb!!#2"}
│ %28 = (getfield)(%27, 1)::Mooncake.CoDual{Float64, NoFData} │
│ %29 = (tuple)(%24)::Tuple{Mooncake.var"#arrayref_pullback!!#635"{1, Vector{Float64}, Int64}} │
│ (push!)(%3, %29)::Core.Const(nothing) │
│ (Mooncake.__push_blk_stack!)(%1, 13)::Core.Const(nothing) │
└── goto #3 │
5 ─ return %7 (before inlining). To orient yourself, observe that
The thing that we're interested in optimising away is the call to To understand what needs to happen here, observe that the type of the pullback for Happily, this is quite straightforward in principle. We can see in the primal code that the array being passed in to Once we've made it so that The remainder of the overhead is a discussion for later, but I believe that the majority of it just requires induction variable analysis. |
Excellent benchmarking, @willtebbutt! Given that the runtime is dominated by bookkeeping overheads here, I wonder whether we can increase the actual compute workload for a single-loop iteration, for example, by partially unrolling the loops: function _sum(f::F, x::AbstractArray{<:Real}) where {F}
y = 0.0
n = 0
while n < length(x)
n += 1
y += f(x[n])
n < length(x) || break
n += 1
y += f(x[n]) # each loop iteration doubles/n-multiple its compute burden
end
return y
end This could serve as a simpler solution than other optimisation approaches based on loop invariants or induction variable analysis. It would also avoid the need for a significant refactoring of EDIT: I googled a bit more after posting my comment: loop unrolling is a mature idea in optimising compilers. |
Thanks for this suggestion @yebai . Let's divide the book-keeping overhead into two categories:
I agree that your proposal to do some loop unrolling ought to reduce the first source of book-keeping overhead -- I agree that it ought to roughly reduce this overhead by half if you have two statements per block, by a third if you have three statements, etc. I also agree that it is likely to have some effect on the overhead associated to pushing / popping data to / from the pullback stacks (you'll have halve the pushes / pops if you double the number of statements per block), but it won't change the amount of data actually being stored in these stacks. This makes it hard to reason about its impact in practice without measuring what's going on without benchmarking -- I'll do some benchmarking with your manually-unrolled example, and see what happens. Implementation: I'm fairly certain that in order to unroll a loop we'd have to
(The first two steps are required in order to identify the loop, and which variable we're iterating over). So it seems to me that all of the ideas discussed 8000 here require that we identify the natural loops, and both loop unrolling / exploiting structure in induction variables require that we identify induction variables. This makes me think that it is unlikely that loop unrolling is likely to be any easier than any of the other things that we have discussed. I actually think it might be a bit more awkward, because we would have to modify the primal IR. Do you think I'm missing something? In terms of how well understood the techniques are, my impression is that natural loop identification, loop unrolling, induction variable analysis, and the identification of loop invariants (and associated code motion) are all very well understood. |
On a separate note, I just dug up some old code (that I thought I had lost!) which identifies natural loops, and does some loop-invariant analysis. As you'll see, it's really surprisingly simple: https://gist.github.com/willtebbutt/16bce3962aa3e9afc11465d0073ccdda |
Given the benchmarking results below (copy and pasted from here)
I suspect that bookkeeping is one of the significant slowdown factors. The other one, likely more significant, is that our autograd transform gets in the way of
|
I agree with this analysis. A quick calculation suggests that total time to run AD for |
A quick thought: can we have a This would allow us to perform optimisations at the AST level instead of the IR level, which could be easier to make compatible with Related: https://juliasimd.github.io/LoopVectorization.jl/stable/ |
I do not know how this would be made to work, but I would be happy to discuss proposals. |
It is inspired by the observation that Mooncake's transform gets in the way of The limitation is that this won't help already defined functions, e.g., Julia's standard library, since one cannot automatically insert such macros into their definitions at the AST level. |
No description provided.
The text was updated successfully, but these errors were encountered: