8000 WIP. Proof of concept for GPU accelerated genArea by hukumka · Pull Request #44 · Cubitect/cubiomes · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

WIP. Proof of concept for GPU accelerated genArea #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

hukumka
Copy link
@hukumka hukumka commented Sep 30, 2020

Hello, and thanks for this awesome library.

This PR is a step toward #18 and implements generation of areas using opencl.

Lacking features

  • Layers past L_SHORE_16
  • Version support

Performance

Then generating 64 seeds per routine, I observed x30 speedup.
Then generating 1 seed per routine, speedup is only x5.

Terribly sorry for dumping such a large chunk of code in a single PR, but I needed to see if
my approach for avoiding recomputing same layer multiple times works before I submitted this.

This implementation is a proof of concept, and missing:
+ Layers past L_SHORE_16
+ Support for different minecraft versions
@Cubitect
Copy link
Owner
Cubitect commented Oct 2, 2020

Thanks for the interest, I was always a little sceptical about performance with a GPU. Generating giant areas in one go might work reasonably well on a GPU, but the code is highly reliant on branching, which is like poison to a GPU and to SSE instructions. Also I find myself needing small areas much more often than large ones, which make this problem much worse. So I always leaned towards distributing workload on CPU cores instead. That said I'm quite interested to see what the performance would actually be using a GPU in different scenarios.

While checking out the your branch I found a bug in the cubiomes library that caused allocCache to allocate too little memory, when the entry point was one of the first few layers. That should be fixed now.

I found a couple of issues with the draft. I think at ocl_test.c:47 it should be bufferA[i + j*W] without the + s*W*H, and it does not seem to work for area sizes below 32x32.

out[xx + 1 + zz * w] = (cs >> 24) & 1 ? v10 : v00;
}
int v;
if (v10 == v01 && v01 == v11) v = v10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a few experiments one day trying to remove this branches, which are from select_mode_or_random. This is the alternative that worked better to me, why is about 25% faster that the "if cascade" on my CPU. I hope that the difference is bigger on a GPU but can't try that myself:

https://github.com/Badel2/slime_seed_finder/blob/9334b161bd4b7b7b8d7251e48623d5803707921c/benches/select_mode_or_random.rs#L118

Suggested change
8000
if (v10 == v01 && v01 == v11) v = v10;
int cv00 = (v00 == v10) + (v00 == v01) + (v00 == v11);
int cv10 = (v10 == v01) + (v10 == v11);
int cv01 = v01 == v11;
if cv00 > cv10 && cv00 > cv01 {
v = v00;
} else if cv10 > cv00 {
v = v10;
} else if cv01 > cv00 {
v = v01;
} else {
// v = random
}

Copy link
Owner
@Cubitect Cubitect Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I see you did a lot of testing and the assembly does look significantly better, if only for the CPU. I did some rudimentary testing with CUDA C, and I was surprised that the improvement was only minor for a GPU. After some digging I found that the nvcc compiler manages to reduce the branching for this part of the device code quite well on its own (at least better than gcc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0