## Tuesday, 28 April 2009

### 3D Gaussian - in sections

After installing VS2008 it took me a little while to get all my projects to compile nicely again. Quite a big jump from 2003->2008 :) and a lot of my paths were a bit out. I'll post later about getting CUDA 2.2 to work in VS2008 and the problems I had making it cross compile.

Just time this evening for a quick update: I finished my 3D Gaussian convolution on my generated data set - the attached image has clearly visible joins in it, these are to make sure my segmented approach of processing big data sets works correctly. The are easily removed by overlapping the non-boundary segments. The colours are based on my transfer function and don't reflect the underlying data accurately yet. It renders the volume at around 40fps.

 3D Gaussian convolution (segmented) output

1. Question: which speed did you reach for small/big datasets.
I program Gaussian smoothing right now also and want to know how fast can one perform.
My separable convolution filtering shows following results for 256x256x128 float values with 8 float radius ocnvolution filter mask:
X direction - 6.9 ms (my data aligned in x direction)
Y direction - 12.5 ms
Z direction - 12.5 ms

2. Hi Sergey,

Those timings look good. My current algorithm is designed for a 2 radius convolution filter (5 wide) and uses uchars as I'm primarily working with image data.
I'll modify it to use a 8 radius filter and floats and see what timings I get.

/Barrett

3. Hi, Barrett.

> I’ll modify it to use a 8 radius filter and floats and see what timings I get.
It would be very nice. Thanks.

Sergey

4. Hi Sergey,

I was a bit short of time last night but did manage to convert the code to use floats.

All timings are for 256x256x256 floats but do not include time to transfer to and from the device.
a = radius 2 Gaussian (5 wide)
b = radius 8 Gaussian (17 wide)

On the 8800GT
a: b:
7ms 13ms

on the C1060
a: b:
5ms 8ms

I didn't time the individual kernels but rather the total time for them all.
The b timing seems too low as I expected it 3x longer than the a timing.

The C1060 timings will probably need to be run again as my cube was busy running other things on the cpu at the same time.

Please note that these timings are provisional as I am yet to write a gold kernel to check the outputs are correct.

5. Hello, Barrett.
This is quite exciting. Too fast for 8800GT. My GPU is 8800GTX and my implementation is too far away from yours. Can you maybe share your experience in making it SO fast? I dont mean to share your code :)
It would be very intresting and somehow cognitive to see how you manage your data, organize it into blocks, perform convolution and so on.

6. Hi Sergey,

I've been meaning to write up the technique I use and publish it and the source code on the site for ages.

But first I need to verify the output is correct :)

Some tips I can give you right now is to have every thread handle multiple data elements. I also stored my convolution co-efficients in constant memory (low latency access) as it seemed a waste to calculate them in every thread. This isnt always true though as the ALU's on the gpu can calc floats in 4 clocks which is the same as constant mem access I believe. But as the Gaussian coefficients need more than one floating point operation its faster just to pre-calc them.

I'll find some time this weekend to verify the code and write something here.

/Barrett

7. Hi Barrett,

I also use pre-calculaed mask for gaussian in each direction and store them in constant memory as well. I thought to store them in shared mem, but decided not to waste it in vain.

The idea to convolve several elemens by means of one thread is worth to try. Hope it can save several calculations for each thread. I will try it as soon as I finish my parallel tridiagonal system solver for my master thesis project (may be you did some research in this direction? :) ).

Thanks for the tips. Wait for your posts.

Regards,
Sergey

8. Hi Sergey,

Posts coming soon :) I've just been absolutely swamped with work from my job so will be posting in installments.

Yes I have done a bit of work on a tridiagonal banded solver. It's a good thing to work on as they are used in many different fields.

/Barrett