Ive been playing with GPU coding and especially CUDA for quite a while now and thought I'd start posting my findings here. NVIDIA have done a great job with CUDA and their GPU families and I thoroughly recommend their products. I'm currently running a 8800GT as my coding board and a 7900GTX as my display card.
I was optimizing my grayscale kernel last night. Its quite an interesting one to optimize as the calculation it performs is trivial. There is no warp divergence at all as there are no conditionals in the kernel. But it depends heavily on global reads and writes.
Now in nvidia GPU land a global memory read experiences about a 600 cycle latency - ouch! But shared mem works in 4 cycles (equivalent to a register). The way the thread scheduler copes with memory latency is to swop the block out and run the next one. This is important as it directly impacts the grayscale optimization.
At first look at the kernel each block is processing x threads with each thread doing one pixel on the image. Its worth mentioning here that I use RGBA bitmap images to help with memory coalescing. To optimize this you could read a block of pixels from global memory to shared memory. Sync your threads, perform the calculation on the shared mem, sync the threads then copy the shared mem back to the global mem.
This actually works but was slower than my initial primitive implementation. In my initial one each thread in a block would read from global memory, do its calculation and write back to global mem. Immediately you can spot the lack of thread syncs there which seem to have an impact (even though the warps wont diverge)
Knowing the thread scheduler takes care of swapping blocks on memory latency the primitive version is no worse than the shared mem implementation IF your grid size is large enough. If you have a large enough grid size the latency to the initial read is still 600cycles but all the blocks are waiting one after each other. In other words block 1,2,3,4,5,etc wait 600 cycles for its read to complete but they are not waiting sequentially they are actually waiting in parallel ( + delta of block swap time).
With a large enough grid size (I'm using the image size / thread count = a big number) this latency issue is minimized without relying on any shared mem.
Next I investigated the thread count in each block, making sure it was always a multiple of the warp size. And going up to 512 threads (my device's max thread count per block) - 2^9 (warp factor 9 :p )
Two Images were tried:
A has a resolution of 2272 by 1704 and B has a resolution of 2275 by 3300
Each image was processed 100 times by the kernel making sure each kernel was finished before the next execution (ie no kernel queuing) and the times in ms were noted.
The results are as follows:
Threads/Block: 32 64 128 256 512
A: 777 794 831 883 1005
B: 4186 2994 2463 2739 4465
Grid Size A: 120984 59640 28968 13632 6816
Grid Size B: 234300 115500 56100 26400 13200
Now before jumping to conclusions based on the thread size only keep in mind an increase in thread count per block results in a decrease in grid size for this kernel. On the smaller image we can clearly see that as the block size decreases (thread count increases) the time increases - this is because at lower block sizes the scheduler cannot hide the memory latencies as effectively. On the larger image (many more memory accesses) there is a clear sweet spot in grid / thread size. In fact the function fits a parabola. We would expect the time to increase as the block size decreases based on our observations from image A. The initial drop in timings may be due to: the scheduler struggles with high block counts (an observation not yet verified by any of my other tests) In fact looking at the grid sizes for A 115000 is approx same size as 120000 etc over the same change in grid size image A's timings increased but B's decreased. The thread totals (grid size*threads per block) are roughly constant too which leads me to believe the thread size in a block does impact performance.
Note that neither image is a perfect multiple of the thread size, there are a couple of ways that I deal with this.
- (the one used in this test) allocate more memory than needed on the device so that the full warp can read / write to something.
- use the CPU to process the bit of the image left over (kernel invocations are async)
-check the size inside the kernel and branch if needed - really try to avoid this. Warp divergence seem to result in a substantial performance hit.
At the moment I don't have any clear cut formula to work out the grid / block size but due to the marked change in performance it is well worth playing around a bit to find the sweet spot. Be sure to make sure your block size is a multiple of the warp size and where ever possible avoid a warp divergence.