Tuesday 12 May 2009

Tesla C1060 memory performance

According to the CUDA programming guide the memory coalescing rules have been relaxed in devices with compute capability 1.2 or greater.  Chapter 5 has a subsection entitled "Coalescing on Devices with Compute Capability 1.2 and Higher" which gives more information.

As I now have a C1060 which has compute capability of 1.3 I thought I'd run my old coalescing tests on it to see how it has improved.  I modified the launch configuration to 32760 blocks in order to maximize utilization of all the Multiprocessors and increased the thread count to 256. These changes cause the kernels to report 100% utilization in profiler.  I expected the uncoalesced part of the old tests to be quite kind to the Tesla as they fall within the parameters given in the programming guide and indeed the memory transfer rates were similar to a pure device to device copy. I then modified the uncoalesced kernels to avoid the auto coalescing of memory accesses in a half-warp.

The Profiler results are as follows:

Device to Device bandwith:  71.95 GB/s  (as reported by the bandwidth sample application - MB/s divided 1024)

Original Test kernels (as per earlier post):

U read U write38.464238.464276.9284
C read U write24.942749.885474.8281
U read C write50.388325.194175.5824
C read C write35.684335.684371.3685

Note that these kernels are not Uncoalesced on a compute 1.3 device according to the programming guide. Its very interesting to note that they are all at or above the device to device transfer rate. This could be due to inaccuracies in the profiler or the bandwidth tester and are well worth further investigation as it might be beneficial to write a kernel to do device memory copies. Of particular note is the 7% throughput increase in the Uncoalesced read and write kernel over the purely coalesced one. I still need to verify if this is due to the way the profiler performs its calculations or if it is infact a real performance increase.

Modified test kernels to avoid coalescing by device:

U read U write26.256526.256552.5131
C read U write5.9941747.953353.9475
U read C write53.47886.6848560.1637
C read C write36.216836.216872.4337

Only the Coalesced read and coalesced write kernel reached the same bandwith as the previous test and they are in turn close to the device memory transfer rate.

The extremely low transfer rates of the middle two kernels could be due to the way the profiler calculates its values so for now I am going to disregard them until I know for certain how they are being calculated or will construct a better test. The result we are looking for was that the Uncoalesced read and write kernels are 38% slower than the coalesced read and write kernel.

In conclusion: It is still beneficial on a compute 1.3 device to ensure your memory access patterns are coalesced or will be coalesced by the device within its limits (as mentioned in the guide) in order to reach maximum data throughput.


  1. [...] my post a few days ago I mentioned that some of the numbers reported by the Visual Profiler needed further [...]

  2. Barrett, we have a vacancy (6-12 month contract followed possibly by permanent employment) for an experience Cuda/open CL computer scientist. I want to put it on your Cuda and OpenCL positions page but don't know how to contact you. Please contact me.
    Robin Brooks