Lid driven cavity with obstacles
Once again the importance of making a gold kernel cannot be understated as I had quite a number of bugs in my initial CUDA implementation. I use a "multi-tap" type approach when debugging where the kernels write intermediate results to device memory as they go along. This can be easily compared to data coming from the various stages of the gold kernel and makes it much easier to identify the source of the error. Keep in mind the CUDA floats will never be the same as the CPU's floats (CPU uses 80 bits to compute intermediate results).