In most programs there comes a time when you need to sort your data into some sort of order. Most recently I have needed to do this in both the "collapsing universe" simulation and my real time ray tracing acceleration. Ideally if you can get away without sorting data you should, but often you need to in order to accelerate a later stage of the algorithm thereby making a nett saving in time.
The popular sorting algorithms such as bubble sort, insertion sort, quicksort, radix sort, heapsort etc range in worst case order times from O(n log n) to O(n^2) . The order here is based on the number of element comparisons. And for memory requirements they range from O(1) to O(log n) (quicksort) although some not quite so popular algorithms use higher memory requirements.
The data I want to sort is already on the GPU I need a very efficient parallel sort as transferring the data back to the host for sorting would take too long. Although certain algorithms such as bubble sort do translate well to a parallel implementation they are far from ideal in terms of time. Recently a paper on GPU-quicksort was released (http://www.cs.chalmers.se/~dcs/gpuqsortdcs.html) which describes an implentation of the quicksort in CUDA. The sorting times from their implementation are very impressive.
Of course as you might have guessed I want to try and make my own parallel sort and have come up with two algorithms so far. While analysing their #comparisons order I realized that your traditional serial order calculations don't work for a massively parallel architecture. In a serial architecture each comparison happens in turn so the comparisons order roughly equals the time order. In a parallel architecture a large number of comparisons happen at exactly the same time so here the comparisons order is much greater than the time order. We really want to minimize the time order so that needs to become the focus of our calculations and not the number of comparisons.
Further constraints on my two candidate algorithms are more specific to CUDA. As thread communication / atomics / syncronization are expensive we need to minimize them. Even more important is to make sure we read and write from global memory in the most efficient way possible to maximize throughput due to the huge latency on global memory access.
After estimating orders on my first algorithm I get: (a) O(log(n^2)+3) (worst case) While this is not a very efficient sort it does allow for a rather efficient memory access pattern. So far I have a fully functioning gold kernel for it and a large chunk of the gpu kernel written. The best way of testing this is to benchmark it against other algorithms as I may be able to get a high occupancy rate for the algorithm that would give faster sorts than its order suggests.
So far I have not analysed my second algorithm but it does look promising. Its memory access pattern is poorer but it may have a better worst case time.
Once they are both completed I will bench mark them against the methods in the paper mentioned above and describe the algorithm in more detail. Maybe even putting them in a proper paper.