As I now have a C1060 which has compute capability of 1.3 I thought I'd run my old coalescing tests on it to see how it has improved. I modified the launch configuration to 32760 blocks in order to maximize utilization of all the Multiprocessors and increased the thread count to 256. These changes cause the kernels to report 100% utilization in profiler. I expected the uncoalesced part of the old tests to be quite kind to the Tesla as they fall within the parameters given in the programming guide and indeed the memory transfer rates were similar to a pure device to device copy. I then modified the uncoalesced kernels to avoid the auto coalescing of memory accesses in a half-warp.
The Profiler results are as follows:
Device to Device bandwith: 71.95 GB/s (as reported by the bandwidth sample application - MB/s divided 1024)
Original Test kernels (as per earlier post):
|U read U write||38.4642||38.4642||76.9284|
|C read U write||24.9427||49.8854||74.8281|
|U read C write||50.3883||25.1941||75.5824|
|C read C write||35.6843||35.6843||71.3685|
Note that these kernels are not Uncoalesced on a compute 1.3 device according to the programming guide. Its very interesting to note that they are all at or above the device to device transfer rate. This could be due to inaccuracies in the profiler or the bandwidth tester and are well worth further investigation as it might be beneficial to write a kernel to do device memory copies. Of particular note is the 7% throughput increase in the Uncoalesced read and write kernel over the purely coalesced one. I still need to verify if this is due to the way the profiler performs its calculations or if it is infact a real performance increase.
Modified test kernels to avoid coalescing by device:
|U read U write||26.2565||26.2565||52.5131|
|C read U write||5.99417||47.9533||53.9475|
|U read C write||53.4788||6.68485||60.1637|
|C read C write||36.2168||36.2168||72.4337|
Only the Coalesced read and coalesced write kernel reached the same bandwith as the previous test and they are in turn close to the device memory transfer rate.
The extremely low transfer rates of the middle two kernels could be due to the way the profiler calculates its values so for now I am going to disregard them until I know for certain how they are being calculated or will construct a better test. The result we are looking for was that the Uncoalesced read and write kernels are 38% slower than the coalesced read and write kernel.
In conclusion: It is still beneficial on a compute 1.3 device to ensure your memory access patterns are coalesced or will be coalesced by the device within its limits (as mentioned in the guide) in order to reach maximum data throughput.