Kernels of course! :)
Most of the readers of this blog should be familiar with a "Gold" kernel in which your data is processed on the CPU (usually) and the output is carefully checked. This kernel and its associated outputs form the basis of the regression testing of subsequent implementations on the GPU including algorithmic optimizations.
Personally I like most of my gold kernels to be naive implementations of an algorithm. This causes them to be easily verifiable and usually easy to debug if there is a problem.
If you currently don't implement a Gold kernel before writing your CUDA implementations and/or adapting you algorithm I strongly suggest you do.
The purpose of this post is to suggest two other debugging techniques I use when needed and where possible. I call them my Silver and Bronze kernels.
A Silver kernel is implemented on the GPU without any optimizations or algorithmic enhancements. The grid / block structure is as simple as possible making sure we don't vary from the Gold kernels implementation too much - only unwinding the loops into grid/blocks is allowed where possible. This type of kernel I use when I am writing something that depends on numerical precision. Once written and verified within acceptable numerical limits against the Gold kernel it becomes the new baseline kernel before later optimizations. This allows exact matching of later kernel outputs rather than using an "acceptable deviation" approach.
My Bronze kernel is extremely useful for detecting errors that occur in long chains of different kernel invocations usually involving large datasets. The CUDA emulator can be used for this but often the performance hit makes it take an unfeasibly long time to get to the area where your bug is occurring. I usually use my Bronze kernel for diagnosing and fixing the "Unspecified Launch error" message.
To implement a Bronze kernel:
Allocate host memory for all the data structures the original kernel depends on
Make sure any constants used on the device are also available on the host
copy the data needed from the device into the host memory we allocated
re-wind our loops etc and re-code our device kernel back into a "loopy" way. Keep in mind any _device_ calls would be actually inlined on the device so try and do the same on the host. Not implementing them in an inline fashion can hide bugs due to stack preservation on the host side. Sometimes unthreading a kernel is rather tricky but as far as possible try and get it back into a single threaded approach unless of course you are trying to debug a thread overlap / sync issue.
execute the kernel - set any breakpoints /watches you may need
copy the data back to the correct structures on the GPU
deallocate the memory we allocated in the first step
The benefit of this approach arises from the host OS / CPU generating exceptions on memory accesses that have gone awry or loops that have exceeded their bounds etc etc. It also allows for easy examination of variables and control flows without having to run the entire program under the emulator.
Recently using the Bronze kernel technique I detected an unsigned int overflow on an index which was basically impossible to find and did not occur on the smaller test sets.