In my last post I mentioned that I was still investigating why the CUDA profiler was counting 4x as many Coalesced writes as it did reads. While the profiler's manual does mention that the counters are merely a measure of performance and not absolute it still did not make sense.
As always the helpful people at NVidia responded to my request almost immediately.
Apparently the GLD gets incremented by 1 when a 32B/64B/128B gld request is sent but the GST gets incremented by 2 for 32B, 4 for 64B and 8 for 128B requests.
This explains perfectly why my counters for GST are 4x higher than the GLD counters. The time differentials between uncoalesced reads/writes seem to remain which is worth remembering for the reasons mentioned in the last article.