As those who follow this blog will know, I have been playing with large generated volumetric data sets (around 16GB). The first operation I performed on them was a 3D Gaussian convolution in order to smooth a lot of the randomness out. My first implementation of the kernel was just the naive one - take a point and look at all the surrounding points. As I'm using a 5x5x5 kernel this means 125 memory reads per point - very poor!
My usual steps to optimize a kernel consist of working out better memory access schemes, maximizing shared memory use and minimizing global/local memory accesses. However, when dealing with huge data sets, you need to take into account the host->device->host transfer transfer speeds especially when you can only place about 400MB onto a card at a time. Typically the PCI express data rate is MUCH slower than anything else in the system. Using a Tesla card with 4GB of memory does help but you soon get data sets that exceed this amount.
With these transfer limitations in mind it becomes apparent that you need to completely calculate all the required values for the chunk of data you have on the device even if this means some of the calculations on the boundaries are incomplete/invalid. For example if you normally calculate all the 1D gaussians in each direction individually then put them together (usually a good optimization) this will be very slow if you are having to chunk the bits of memory needed in and out. Then do it again for the next direction by pulling back to the device the partial calculation.
The best method I have found so far is to load in the largest chunk of memory the device can handle (bind it to a texture to take advantage of locality caching) then do the x,y and z directions on it before transferring it back to the host. Two problems arise immediately:
Data is incompletely processed on the boundaries of the block - this can be resolved by overlapping the blocks slightly
The indexing of the 3d data on the device is not the same as the indexing in the main (host) memory. This implies that copying data to the device becomes more tricky as portions of data have to be layered together and likewise copying back from the device means the data has to be "delayered" back into its correct structure on the host.
I found a good way to handle the latter case is to use SIMD instructions to copy row at a time from the main chunk of memory to a holding area (pinned memory if you want maximum performance) remembering to keep track of the calculated offsets to save time when the data is transferred back from the device. Use these offsets to copy it back to where it came from or in the case of the Gaussian Kernel to a temp data area. The temp data area has the same relative offsets for the rows and planes so storing them is not wasted.
In short big data sets add a significant extra layer of complexity to a problem and can completely change your standard kernel optimization strategy. Hope this post helps a bit :)