WIP. Proof of concept for GPU accelerated genArea #44

hukumka · 2020-09-30T13:46:58Z

Hello, and thanks for this awesome library.

This PR is a step toward #18 and implements generation of areas using opencl.

Lacking features

Layers past L_SHORE_16
Version support

Performance

Then generating 64 seeds per routine, I observed x30 speedup.
Then generating 1 seed per routine, speedup is only x5.

Terribly sorry for dumping such a large chunk of code in a single PR, but I needed to see if
my approach for avoiding recomputing same layer multiple times works before I submitted this.

This implementation is a proof of concept, and missing: + Layers past L_SHORE_16 + Support for different minecraft versions

Cubitect · 2020-10-02T21:52:59Z

Thanks for the interest, I was always a little sceptical about performance with a GPU. Generating giant areas in one go might work reasonably well on a GPU, but the code is highly reliant on branching, which is like poison to a GPU and to SSE instructions. Also I find myself needing small areas much more often than large ones, which make this problem much worse. So I always leaned towards distributing workload on CPU cores instead. That said I'm quite interested to see what the performance would actually be using a GPU in different scenarios.

While checking out the your branch I found a bug in the cubiomes library that caused allocCache to allocate too little memory, when the entry point was one of the first few layers. That should be fixed now.

I found a couple of issues with the draft. I think at ocl_test.c:47 it should be bufferA[i + j*W] without the + s*W*H, and it does not seem to work for area sizes below 32x32.

Badel2 · 2020-10-02T23:11:33Z

ocl_kernels.cl

+        out[xx + 1 + zz * w] = (cs >> 24) & 1 ? v10 : v00;
+    }
+    int v;
+    if      (v10 == v01 && v01 == v11) v = v10;


I did a few experiments one day trying to remove this branches, which are from select_mode_or_random. This is the alternative that worked better to me, why is about 25% faster that the "if cascade" on my CPU. 8000 I hope that the difference is bigger on a GPU but can't try that myself:

https://github.com/Badel2/slime_seed_finder/blob/9334b161bd4b7b7b8d7251e48623d5803707921c/benches/select_mode_or_random.rs#L118

Suggested change

if (v10 == v01 && v01 == v11) v = v10;

int cv00 = (v00 == v10) + (v00 == v01) + (v00 == v11);

int cv10 = (v10 == v01) + (v10 == v11);

int cv01 = v01 == v11;

if cv00 > cv10 && cv00 > cv01 {

v = v00;

} else if cv10 > cv00 {

v = v10;

} else if cv01 > cv00 {

v = v01;

} else {

// v = random

}

This looks great! I see you did a lot of testing and the assembly does look significantly better, if only for the CPU. I did some rudimentary testing with CUDA C, and I was surprised that the improvement was only minor for a GPU. After some digging I found that the nvcc compiler manages to reduce the branching for this part of the device code quite well on its own (at least better than gcc).

Implement opencl based analog for genArea

4a24031

This implementation is a proof of concept, and missing: + Layers past L_SHORE_16 + Support for different minecraft versions

Badel2 reviewed Oct 2, 2020

View reviewed changes

Fix bug with tests failing on SEED_RANGE>H

dd4663f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP. Proof of concept for GPU accelerated genArea #44

WIP. Proof of concept for GPU accelerated genArea #44

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

-    if      (v10 == v01 && v01 == v11) v = v10;
+    int cv00 = (v00 == v10) + (v00 == v01) + (v00 == v11);
+    int cv10 = (v10 == v01) + (v10 == v11);
+    int cv01 = v01 == v11;
+    if cv00 > cv10 && cv00 > cv01 {
+        v = v00;
+    } else if cv10 > cv00 {
+        v = v10;
+    } else if cv01 > cv00 {
+        v = v01;
+    } else {
+        // v = random
+    }

WIP. Proof of concept for GPU accelerated genArea #44

Are you sure you want to change the base?

WIP. Proof of concept for GPU accelerated genArea #44

Uh oh!

Conversation

Lacking features

Performance

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!