-
Notifications
You must be signed in to change notification settings - Fork 114
WIP. Proof of concept for GPU accelerated genArea #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This implementation is a proof of concept, and missing: + Layers past L_SHORE_16 + Support for different minecraft versions
Thanks for the interest, I was always a little sceptical about performance with a GPU. Generating giant areas in one go might work reasonably well on a GPU, but the code is highly reliant on branching, which is like poison to a GPU and to SSE instructions. Also I find myself needing small areas much more often than large ones, which make this problem much worse. So I always leaned towards distributing workload on CPU cores instead. That said I'm quite interested to see what the performance would actually be using a GPU in different scenarios. While checking out the your branch I found a bug in the cubiomes library that caused I found a couple of issues with the draft. I think at |
out[xx + 1 + zz * w] = (cs >> 24) & 1 ? v10 : v00; | ||
} | ||
int v; | ||
if (v10 == v01 && v01 == v11) v = v10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a few experiments one day trying to remove this branches, which are from select_mode_or_random
. This is the alternative that worked better to me, why is about 25% faster that the "if cascade" on my CPU. I hope that the difference is bigger on a GPU but can't try that myself:
if (v10 == v01 && v01 == v11) v = v10; | |
int cv00 = (v00 == v10) + (v00 == v01) + (v00 == v11); | |
int cv10 = (v10 == v01) + (v10 == v11); | |
int cv01 = v01 == v11; | |
if cv00 > cv10 && cv00 > cv01 { | |
v = v00; | |
} else if cv10 > cv00 { | |
v = v10; | |
} else if cv01 > cv00 { | |
v = v01; | |
} else { | |
// v = random | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I see you did a lot of testing and the assembly does look significantly better, if only for the CPU. I did some rudimentary testing with CUDA C, and I was surprised that the improvement was only minor for a GPU. After some digging I found that the nvcc compiler manages to reduce the branching for this part of the device code quite well on its own (at least better than gcc).
Hello, and thanks for this awesome library.
This PR is a step toward #18 and implements generation of areas using opencl.
Lacking features
Performance
Then generating 64 seeds per routine, I observed x30 speedup.
Then generating 1 seed per routine, speedup is only x5.
Terribly sorry for dumping such a large chunk of code in a single PR, but I needed to see if
my approach for avoiding recomputing same layer multiple times works before I submitted this.