Saturday 16 May 2009

Tesla C1060 memory performance #2

In my post a few days ago I mentioned that some of the numbers reported by the Visual Profiler needed further investigation.

In particular the device to device memory bandwith reported by the profiler differed from the value reported by the bandwith test sample. This was easily tracked down:  according to the documentation, the profiler divides the bytes read/written by 10^9 whereas the bandwith test sample divides by ( time in milliseconds * 2^20).    So: 1000*2^20/10^9 = 1.048576    i.e. the bandwith sample reports a lower number.



On my machine I get 73687 MB/s device to device bandwith reported by the bandwidth sample.  Multiply that by the number above and you get 77266MB/s which falls in line with that I see from my Uncoalesced Read / Uncoalesced Write sample.  So it doesn't appear to be faster to write a memory copy kernel that using the build in memorycopy function - despite what we noticed from the previous post.

For the sake of simplicity I will continue to refer to Uncoalesced as U and Coalesced as C.

When further investigating the numbers from the second test set I noted various differences: The kernels were produced with different register counts. In the first test each kernel had 3 registers per thread. In the second test set the C read C write kernel still had 3 registers, the U read U write kernel had 4 registers but the U read C write / C read U write each have 2 registers per thread. 

The profiler also reported that the U read/writes were changed to 32b accesses, whereas the C reads/writes were 64bit accesses to global memory.  What makes this very interesting is in the test set where the compiler/device could manage to coalesce the Uncoalesced reads/writes it managed to do them as 128b accesses whereas the C accesses were at 64b !  This could explain why the U read/write kernel from the first set achieved the highest bandwith. (or so it would seem with the numbers at the moment - we shall soon see this is not the case)

So in the second test the C reads / writes stay at 64b but the U reads/writes change from 128b to 32b ! Ouch!

Now to the memory throughput numbers reported by the profiler:

Each kernel is called 960 times with the following total time (usec) reported:

read U write U:     9.13834e+06
read C write U:     5.00365e+06
read U write C:     4.48666e+06
read C write C:     828142

Now we know by looking at the kernels that each one should read 4bytes (1 float) and write 4 bytes per thread - irrespective of how the device/compiler decides to perform these accesses.

In the second test we have 32760 blocks and 256 threads per block so we should have: 960 * 32760 * 256 * (4+4) bytes read/written from global memory.  This gives us 64408780800 bytes for all 960 launches.

This gives us:  (dividing by usec will give approximately the MB/s - 4.8576% difference from actual using 1024 bytes in a K etc)

read U write U:     7048.1926
read C write U:     12872.3593
read U write C:     14355.6188
read C write C:     77775.044

The only one that still seems similar is the C read/C write. Why the difference? 

The visual profiler documentation says the following about overall memory throughput:

"mem overall throughput (GB/s): Overall global memory access throughput in gigabytes per second. This is computed as (total bytes read + total bytes written)/(gpu time). Total bytes read is calculated using the profiler counters gld_32b, gld_64b and gld_128b. Total bytes written is calculated using the profiler counters gst_32b, gst_64b and gst_128b. This is supported only for GPUs with compute capability 1.2 or higher"

The clue here is that its using all the counters so I think that the profiler is calculating on the total memory read/writes rather than what we are actually using in the kernels.  This implies that our test results are even worse than the profiler indicates!   The profiler isn't exactly incorrect - it's reporting what it sees. In this case we know what memory we are reading / writing so can improve on its calculations.

Using our new-found knowledge lets go back and look at the Auto Coalesced test results from the first set:

There were 1984 calls resulting in 1984 * 32760 * 256 * (4+4) bytes read/written from global memory = 133111480320 bytes.

read U write U:    3.23675e+06 = 41125.0422   (+-MB/s)
read C write U:    2.47407e+06 = 53802.6330   (+-MB/s)
read U write C:    2.45264e+06 = 54272.7348   (+-MB/s)
read C write C:    1.71156e+06 = 77772.0210   (+-MB/s)

 So our read C write C runs from both the sets of tests are almost exactly the same now - as it should be!

In summary:  Don't trust the profiler completely. Although the compute capability 1.2 and higher devices do a very good job in coalescing reads and writes it is still not as efficient as hand-optimizing your accesses.  Futhermore if you do exceed the auto coalescing guidelines you will pay a very heavy price in performance.

No comments:

Post a Comment