Archive for the ‘Sorting’ Category

GPU Radix Sort

Thursday, March 5th, 2009

In the ongoing search of finding efficient GPU sorting algorithms I came across the following paper: http://mgarland.org/files/papers/gpusort-ipdps09.pdf   by Nadathur Satish, Mark Harris, and Michael Garland.  It is a preprint for the 23rd IEEE Int’l Parallel & Distributed Processing Symposium, May 2009 (http://www.ipdps.org/).

I have not tested if it is faster than the GPU quicksort (http://www.cs.chalmers.se/~dcs/gpuqsortdcs.html). Normal quicksort does slow down on nearly sorted or already sorted data so a radix sort might exhibit more stable behaviour.

Investigation of Uncoalesced vs Coalesced Read and Write Speeds in CUDA

Tuesday, January 27th, 2009

While working on my sorting algorithms I came across an interesting scenario. At the moment my still incomplete sort uses two kernels. At the end of kernel A I have the choice of writing out to global memory in an uncoalesced or coalesced manner.  If kernel A writes out in a coalesced way then kernel B’s initial read of global memory will have to be uncoalesced. Likewise if kernel A’s write is uncoalesced then kernel B’s read will be coalesced.

The CUDA documentation does not mention if uncoalesced writes are the same speed wise as uncoalesced reads so in order to get the peak performance of my sorting I needed to run a few tests with different kernels in the CUDA profiler.

I made four kernels, each doing a very simple task:  read in from global memory and store in shared mem then write from shared mem into a different area of global memory. Each kernel reads or writes to or from global memory in a different fashion to ensure coalesced / uncoalesced access for my 8800GT.  Note that on newer cards the coalescing rules are a bit different (see the documentation for details)

(more…)