Thursday, 30 July 2009

3D Gaussian Convolution

There hasn't been much in the way of posts here lately as I've been really busy at work getting some new components built into the systems I work on. Not really hard but it's frustrating things like trying to get various components and libraries written in different languages to work together. So lately I've not had the energy to do much work on the computer once I get home...

There has been a bit of interest in my 3D Gaussian convolution kernels. Although I explained the technique mathematically in an earlier post I never actually posted the code. As it is rather quick and quite a novel way of calculating the convolution for a xy plane I decided to post it so everyone can benefit from / improve upon the technique.  As always comments / bug reports etc are always welcome :)

The code is a series of kernels that will do a Gaussian Convolution (smoothing) on a 3D data set of floats.   The Gaussian kernel that I used is a 5x5x5 one but it is relatively easy to extend this.  The data set is a volume 256x256x256 of floats. Again with some modifications the kernels will handle a greater or lesser volume - up to the max memory of your card.

In a trivial implementation you will read 5x5x5 data elements centred around your x,y,z point and then multiple them with their corresponding co-efficient. Finally summing them and dividing before saving into a seperate memory area. Of course this means 125 reads and associated calculations for every single data element in our 256x256x256 block resulting in over 2 billion reads - not very efficient!

Using the fact that the Gaussian filter we are using is actually a convolution we can do each direction in turn before summing the result and dividing. This gives us 256x256x256 x 5 x 3 = 251658240 reads - which is over 8 times lower than the trivial implementation!

As we are using the co-efficients many times it is beneficial to pre-calculate these and store them in constant memory. Constant memory on the device is cached and is really fast. Please note that this may not always be the case but for the Gaussian co-efficients that require more than one floating point operation to calculate this results in a large time saving.  This is implemented in "calcGaussianCoefficients" in the source code here:  [download id="5"] .

Now the general method would be to process the x,y and z directions in turn, however it is possible to calculate a plane at a time while only having to store one column at a time in shared memory.  If all threads in a block are calculating a row each they will finish an element of the row at the same time (make sure to syncthreads to ensure they are all done) as we then have a completed calculation for the same element in each row it is then possible to convolve in the y direction without having to read any additional data from global memory.   Once the y convolution is completed you save the result back to a tempory area in global memory.   This technique ensures you get your memory reads and writes fully coalesced while simulateously minimizing the number of memory reads required.  The limited shared memory available can mean there may be bank conflicts between the threads but a bank conflict is still faster than a global memory read.  This is implemented in "calcGaussianXYPlaneConvolution256" in the source code:  [download id="5"] .

The code for the xy convolution is very quick and uses a high percentage of the available bandwidth. As of writing it is the fastest Gaussian convolution I have seen for a plane. Certain assumptions have been made about the boundary conditions which are perfectly acceptable for my application but may not be ideal for your own. It is a good idea to check the output against what you expect using a Gold or Silver kernel.

Please note the code supplied is not generic and is specific to a 5x5x5 Gaussian on a set of  256x256 planes although it is very easy to modify to other sizes. The z convolution is not included in the source. In the z convolution you would need to divide to get the correct value.

As usual the code is licensed under the:  Creative Commons - Attribution-Share Alike 2.0 UK: England & Wales  License - ie feel free to use it in anything but some acknowledgement would be nice :)

1 comment:

  1. Quite a nice article however the downloads are missing.