As those of you who follow the blog will know, I have been working on an acceleration mechanism for my raytracer over the past week.
As it turns out converting my large stack of A4 sheets of diagrams and equations into proper code has not been as easy as I had hoped. With about a third of it implemented I am getting no speed up what so ever on cudart. In fact I have lost about 2fps. This is largely to be expected as I am pretty much hitting the limits of what my card can do.
All is not lost though... Although my current scene that is rendered in cudart doesnt go any faster more complex scenes dont have as rapid a drop of in fps as before. By more complex here I am referring to the simple case of placing more objects in the scene. The real test is when the objects are placed in such a way to minimize the effect of the acceleration structure.
The challenge with cuda 1.1 is that you take a really big performance hit if more than one thread in a warp access the same memory location. (something which I believe has been improved in 2.0) . The way cudart currently works is to batch in the various objects in shared memory (16k limit on shared mem) and then operate on the shared memory. This technique has been working really well so I have been trying to design an algorithm that will only batch in the closest objects likely to be intersected by the thread block. While this isn't too hard for your initial rays as they are all going in the same direction and within a block are largely coherent it gets a bit more tricky for reflected and shadow rays. I'm sure the problem is rather evident now - 192 threads in a block each having a chance of intersecting x objects rapidly gets bigger than available shared memory. Batching in more available candidates works but does slow the process somewhat. The challenge is to get all possible intersection candidates of the block into shared memory, as the memory size is set the only way is to improve the acceleration algorithm to a point where it produces only highly likely intersection objects. Of course the more finely grained this become the longer it takes to calculate and any advantage is lost. I am currently trying to find the most efficient granularity using my algorithm to accelerate the scene.
As always keep watching this space :)