Posts Tagged ‘CUDA’
Tuesday, June 22nd, 2010
No apologies for the long delays between posts, or even checking the blog. It just has to fit in with life at the moment and there is so much going on!
Those of you who have left comments on the blog and emails for me should have got an answer last night or this morning. A bit of a delay for some of you. I do apologise for those of you who sent an email and haven’t got a reply back. My computer that used to handle all my email died and I haven’t had a chance to fix it yet. I think its just the power supply…. hopefully! so will get the emails back in a few weeks when I eventually get around to fixing it.
I’ve subsequently upgraded my email server - so any forthcoming mails will get to me.
In development work I’ve been working on my SPH simulations, and some GP stuff whenever I get a chance. GP stuff is traditionally recursive - well the equation trees anyway and have needed a substantial amount of reworking to get working efficiently on the GPU.
Speaking of recursive…. in order to be Turing Complete (assuming infinite memory for now) do you need to support / include recursion? Some posters on certain forums seem to think it is needed, but personally I can’t see why? Most recursion with a bit of effort can be iterative - although possibly not very pretty or efficient.
For example. A GPU doesn’t really support recursion*, but I would consider cuda / GPU combination as Turing complete. Admittedly not very efficient in certain cases - single thread for example. And again ignoring the infinite memory issue. *You can if you implement your own stack type system in global memory…
I’d be interested in knowing others views on this - email the normal place or comment here
To all the regular readers of the blog - anyone else being amazed by the absolute explosion of GPU / CUDA related code / products / hardware. Very exciting indeed!
Tags: CUDA, GPU, Turing, Turing Complete
Posted in CUDA | No Comments »
Friday, March 19th, 2010
Finally the promised screenshot

SPH with symmetry
It’s not all that impressive to look at as I’ve restricted all the particles to 2d although it does use 3d calculations. I do this to help look for any issues in the code as I find it hard to spot errors in a 3d particle rendering.
This particular screenshot has 64000 particles that have been dropped into the box in a column formation and are now starting to slosh around at the bottom.
The unusual thing with regards to a CUDA implementation is that it is using symmetry in the interactions thereby decreasing the memory/processing load. I’ve still got more work to do but its showing a lot of promise in running superfast particle interaction simulations.
I’ve aso been doing a bit of work on my second version of my raytracer. I’ve once again stepped away from KD-trees and Octrees and am using a type of BVH, ray marching system. Screenshots once I have a decent scene rendered
In other news I’m now compiling all my new C++/CUDA code in 64bit with the CUDA 3.0 beta. Although I think putting in c++ object support into CUDA was a mistake the new version does produce decent code.
Tags: CUDA, ray tracing, SPH
Posted in CFD, CUDA, Development | No Comments »
Friday, February 26th, 2010
Nearly 3 months since my last post
Work has been exceptionally busy: In the last two months on top of my normal product maintenance and improvement duties I have prepared and filed a patent application, architected and largely completed a distributed, resilient document processing framework and found a bit of time to eat and sleep!
I’ve noticed other blogs in the raytracing / graphics / visualization space have been very quiet lately - maybe everyone else is also working like crazy?
Not a huge amount has happened in my raytracer and SPH projects although got some interesting effects running with a non-uniform mass particle system when I had time over Christmas. Screenshots soon.
I do have the beta release of Nexus (the NVidia Visual Studio plugin) but sadly it only runs on Windows Vista or Windows 7 which leads nicely on to my next point:
I am a bit irritated with Microsoft for two reasons: Even though I purchased a 64 bit Windows XP professional about 6 or 8 months ago there is no upgrade path to Windows 7… Secondly even though visual studio 2008 standard has a switch for openMP it doesnt contain the openmp headers. Only the more expensive professional version does. Not something that was immediately obvious from the documentation before I purchased…
Although I also run Linux (centos) I prefer to develop on a Windows GUI - less buggy and more responsive than gnome / kde in my opinion. For running code the Linux os does usually win though! I would really like to run Nexus so am a bit stuck about what to do…. Succumb and buy Windows 7 and get Nexus on Visual Studio? or just forget entirely about Windows development / environment and use Linux / gcc / Intel compilers instead? While the Intel compilers are great (if a bit expensive) for an IDE I really do like Visual Studio. Most of my code is cross platform and for graphics I mostly use openGL so could switch without too much trouble… But direct compute is so tempting…..
Arrrrgh what to do!
Tags: Centos, CUDA, Linux, Nexus, openMP, Visual Studio, Windows 7
Posted in Development, Uncategorized | 2 Comments »
Thursday, October 1st, 2009
I’ve just completed reading the white paper released by nvidia which you can find here.
Rather interestingly no mention of graphics performance has been made which, in a way, is really exciting. This has clearly been aimed at the high performance or throughput computing markets with the notable inclusion of ECC memory and increased double precision throughput along with the updated IEEE 754-2008 floating point support.
Concurrent kernel execution and faster context switching will allow, with the use of DAG’s, the optimization of execution on the devices rather than just working out the most efficient order of kernels to execute sequentially.
Also tucked away in the white paper is the mention of have predication at the instruction level which should give greater control of divergent paths in your kernels.
The inclusion of C++ support will appeal to a lot of people but am I rather unconvinced this is the correct way to go for throughput computing as it will encourage the use of all the old patterns that may work well in serial cases but are often rather poor for enabling maximum throughput.
There is a lot more in the paper and already an announcement by Oak Ridge that they will be using it in a new supercomputer.
All in all its a wonderful development and I can’t help feeling that computing took a substantial leap forward today.
Tags: CUDA, Fermi, NVidia, NVidia Fermi
Posted in CUDA, Uncategorized | No Comments »
Monday, June 22nd, 2009
Kernels of course!
Most of the readers of this blog should be familiar with a “Gold” kernel in which your data is processed on the CPU (usually) and the output is carefully checked. This kernel and its associated outputs form the basis of the regression testing of subsequent implementations on the GPU including algorithmic optimizations.
Personally I like most of my gold kernels to be naive implementations of an algorithm. This causes them to be easily verifiable and usually easy to debug if there is a problem.
If you currently don’t implement a Gold kernel before writing your CUDA implementations and/or adapting you algorithm I strongly suggest you do.
The purpose of this post is to suggest two other debugging techniques I use when needed and where possible. I call them my Silver and Bronze kernels.
A Silver kernel is implemented on the GPU without any optimizations or algorithmic enhancements. The grid / block structure is as simple as possible making sure we don’t vary from the Gold kernels implementation too much - only unwinding the loops into grid/blocks is allowed where possible. This type of kernel I use when I am writing something that depends on numerical precision. Once written and verified within acceptable numerical limits against the Gold kernel it becomes the new baseline kernel before later optimizations. This allows exact matching of later kernel outputs rather than using an “acceptable deviation” approach.
(more…)
Tags: CUDA, CUDA debugging, CUDA large data sets, gold kernel, Gold Silver Bronze kernels
Posted in CUDA | No Comments »
Wednesday, June 10th, 2009
As many people have noticed the same code executed in Emulator mode gives different floating point results from the kernels run in Debug or Release mode.
Although I know what causes this I have never bothered to investigate the actual differences as most of the stuff I write runs entirely on the GPU. Recently I have had to compare results on the CPU<->GPU and wrote some code to change the FPU settings. Firstly a quick explanation:
By default the CPU (FPU) is set to use 80 bit floating point internally. This means that when you load in an integer (fild) or a single / double float (fld) it gets converted to a 80 bit number inside the FPU stack. All operations are performed internally at 80 bits and when storing the result it converts back to the correct floating point width (single / double) (fst / fstp).
This method of operation is desirable as it reduces the effect of rounding / truncating on the intermediate results. Of course while very useful for computing on the CPU this is not how the CUDA devices operate.
(more…)
Tags: Comparing CUDA results, CUDA, CUDA emulator, CUDA Numerical Stability, doubles, floats, FPU, FPU precision, FPU rounding, Numerical Stability
Posted in CUDA, Development, Maths, Uncategorized | 11 Comments »
Saturday, May 16th, 2009
In my post a few days ago I mentioned that some of the numbers reported by the Visual Profiler needed further investigation.
In particular the device to device memory bandwith reported by the profiler differed from the value reported by the bandwith test sample. This was easily tracked down: according to the documentation, the profiler divides the bytes read/written by 10^9 whereas the bandwith test sample divides by ( time in milliseconds * 2^20). So: 1000*2^20/10^9 = 1.048576 i.e. the bandwith sample reports a lower number.
(more…)
Tags: Coalescing, Compute Capability, CUDA, Memory Access Patterns, memory performance, overall memory throughput, Tesla C1060, Uncoalesced Reads, Uncoalesced Writes, Visual Profiler
Posted in CUDA | No Comments »
Tuesday, May 12th, 2009
I managed to get some time last night to convert my LBM implementation to CUDA. Its far from optimal at the moment. Here is a screen shot showing the lid-driven-cavity with two obstacles, one horizontal and one vertical in the box. The visualization is from the middle plane. The obstacles are not shaded but are easily seen in the picture as an area of black. This image was taken after 16000 timeslices and some nice stable vortices have developed.

D3Q19 - lid driven cavity with obstacles
Once again the importance of making a gold kernel cannot be understated as I had quite a number of bugs in my initial CUDA implementation. I use a “multi-tap” type approach when debugging where the kernels write intermediate results to device memory as they go along. This can be easily compared to data coming from the various stages of the gold kernel and makes it much easier to identify the source of the error. Keep in mind the CUDA floats will never be the same as the CPU’s floats (CPU uses 80 bits to compute intermediate results).
Tags: CUDA, D3Q19, gold kernel, Lattice Boltzmann Method, LBM, Lid Driven Cavity
Posted in CFD, CUDA | No Comments »
Thursday, May 7th, 2009
After playing with a D2Q9 lattice mentioned in a previous post I felt I’d learned enough to progress to the wonderful 3D world
The arrival of my Tesla has also given me more processing power to move into three dimensions.
So far, in order to get data I can compare the CUDA kernels against, I have come up with a cpu gold kernel. For the first test have constructed a Box with 5 sides and constantly moving flow in the top layer of the box. I am only simulating incompressible at the moment.
I have made a very simple OpenGL viewer that can either show me the entire box or a plane through it. The direction of the velocity vectors is indicated both by their colour and direction of line while the magnitude of the velocity is indicated by the length of the line (thats why some spill over the edges of the box). The below image shows a section through the middle of the box.

Visualization from the gold kernel
The next step is to construct the CUDA kernels and compare their outputs. I’m hoping for a massive increase in lattice update speed whilst maintaining numerical stability. The gold kernel uses doubles for reference purposes.
Tags: CUDA, D2Q9, D3Q19, Lattice Boltzmann Method
Posted in CFD, CUDA, Development | No Comments »
Wednesday, April 29th, 2009
As promised I received my Visual Studio 2008 on Friday from polyhedron and set about installing it on Saturday. This post will describe how to get VS 2008 to work with CUDA. I use CUDA 2.2 which is still under NDA so I wont be making any comments about its performance improvements today. Rather I will describe how to set up syntax highlighting, building and Intellisense.
(more…)
Tags: CUDA, CUDA Intellisense, Intellisense, Syntax Highlighting, Visual Studio 2008
Posted in CUDA, Development, Uncategorized | 17 Comments »