8000 store_in(MemoryType::Stack) should use alloca if the size is small by abadams · Pull Request #6289 · halide/Halide · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

store_in(MemoryType::Stack) should use alloca if the size is small #6289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 11, 2021

Conversation

abadams
Copy link
Member
@abadams abadams commented Oct 5, 2021

If an allocation for a Func marked as store_in MemoryType::Stack is just made once (e.g. because it's compute_at just inside a parallel loop), then we currently just call malloc and there's no win from placing it on the stack.

This PR changes the behavior of MemoryType::Stack to call alloca instead when the size is small. This helps a lot in the single-use case.

It's the same pseudostack logic as in master. It allocates on first use (even if inside a loop), and reallocates only if the required size grows. Stack is never freed. If the cumulative stack usage for all reallocations for one Func exceeds our can_fit_on_stack threshold, we switch to heap for all future reallocations of this Func.

The following three runtimes are from a test of the single-use case where in the first case we statically know the allocation size, so we can just size the original stack frame appropriately. In the second case we use the new behavior in this PR and subtract from the stack pointer by a computed amount. In the third case we call malloc, as we would on master.

Constant-sized stack allocation: 0.165252
Use alloca: 0.184218
Use malloc: 0.215683

The runtimes are close together because allocating isn't the only thing this pipeline does. You should view the first of the three figures as the zero, which means the new behavior is about 2.5x faster than the old behavior when it comes to the cost of allocating.

Those numbers are from linux. On macos malloc is slower and the win is 7x.

Copy link
Contributor
@dsharletg dsharletg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

Value *size_zero = ConstantInt::get(size_type, 0);
Value *alloca_size = builder->CreateSelect(returned_null, llvm_size, size_zero);
// Allocate it. It's zero most of the time.
Value *stack_ptr = builder->CreateAlloca(i8_t->getPointerTo(), alloca_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is alloca(0) guaranteed to be a no-op? The docs say: https://llvm.org/docs/LangRef.html#id203

Allocating zero bytes is legal, but the returned pointer may not be unique.

So it's allowed to be a no-op, but not sure if that will always be the case...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From that explanation. I would expect it it to return the current stack pointer without doing anything else. Whether that is a nop or not is debatable, but I think that serves the purpose here. I gather the worry is it may instead allocate the smallest size or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we should just go ahead with this and assume that it is a no-op. I can't imagine any target doing something else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the alloca behind a branch instead was cheaper anyway in the common case of not reallocating.

@@ -262,6 +262,7 @@ bool function_takes_user_context(const std::string &name) {

bool can_allocation_fit_on_stack(int64_t size) {
user_assert(size > 0) << "Allocation size should be a positive number\n";
// Should match the threshold defined in runtime/pseudostack.cpp
Copy link
Member
@alexreinking alexreinking Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worthwhile to create a runtime/constants.h that's just a bunch of #defines or constexprs or something that records constants that need to be kept consistent between the compiler and the runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but can you think of any others to put in there? I've been hunting and coming up blank.

Copy link
Member
@alexreinking alexreinking Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the target features enum belongs there? There are also a few places we include all of HalideRuntime.h for a single macro (e.g. HALIDE_ALWAYS_INLINE). I still think it's worthwhile even for one constant, though, since it's a path towards enforcing this and future constant consistency in the code rather than a reminder in a comment.

@steven-johnson
Copy link
Contributor

Lots of failures, apparently

@abadams
Copy link
Member Author
abadams commented Oct 8, 2021

builds are clean. ptaal

// of total stack used per Func, we bail and start making heap
// allocations instead.

// The following would use 200 mb of stack if we just kept
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

200mb of stack ought to be enough for anybody

@abadams abadams merged commit e058532 into master Oct 11, 2021
Sign up for free 78AE to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0