Archive for the ‘CUDA’ Category
Tuesday, June 22nd, 2010
No apologies for the long delays between posts, or even checking the blog. It just has to fit in with life at the moment and there is so much going on!
Those of you who have left comments on the blog and emails for me should have got an answer last night or this morning. A bit of a delay for some of you. I do apologise for those of you who sent an email and haven’t got a reply back. My computer that used to handle all my email died and I haven’t had a chance to fix it yet. I think its just the power supply…. hopefully! so will get the emails back in a few weeks when I eventually get around to fixing it.
I’ve subsequently upgraded my email server - so any forthcoming mails will get to me.
In development work I’ve been working on my SPH simulations, and some GP stuff whenever I get a chance. GP stuff is traditionally recursive - well the equation trees anyway and have needed a substantial amount of reworking to get working efficiently on the GPU.
Speaking of recursive…. in order to be Turing Complete (assuming infinite memory for now) do you need to support / include recursion? Some posters on certain forums seem to think it is needed, but personally I can’t see why? Most recursion with a bit of effort can be iterative - although possibly not very pretty or efficient.
For example. A GPU doesn’t really support recursion*, but I would consider cuda / GPU combination as Turing complete. Admittedly not very efficient in certain cases - single thread for example. And again ignoring the infinite memory issue. *You can if you implement your own stack type system in global memory…
I’d be interested in knowing others views on this - email the normal place or comment here
To all the regular readers of the blog - anyone else being amazed by the absolute explosion of GPU / CUDA related code / products / hardware. Very exciting indeed!
Tags: CUDA, GPU, Turing, Turing Complete
Posted in CUDA | No Comments »
Friday, March 19th, 2010
Finally the promised screenshot

SPH with symmetry
It’s not all that impressive to look at as I’ve restricted all the particles to 2d although it does use 3d calculations. I do this to help look for any issues in the code as I find it hard to spot errors in a 3d particle rendering.
This particular screenshot has 64000 particles that have been dropped into the box in a column formation and are now starting to slosh around at the bottom.
The unusual thing with regards to a CUDA implementation is that it is using symmetry in the interactions thereby decreasing the memory/processing load. I’ve still got more work to do but its showing a lot of promise in running superfast particle interaction simulations.
I’ve aso been doing a bit of work on my second version of my raytracer. I’ve once again stepped away from KD-trees and Octrees and am using a type of BVH, ray marching system. Screenshots once I have a decent scene rendered
In other news I’m now compiling all my new C++/CUDA code in 64bit with the CUDA 3.0 beta. Although I think putting in c++ object support into CUDA was a mistake the new version does produce decent code.
Tags: CUDA, ray tracing, SPH
Posted in CFD, CUDA, Development | No Comments »
Thursday, October 1st, 2009
I’ve just completed reading the white paper released by nvidia which you can find here.
Rather interestingly no mention of graphics performance has been made which, in a way, is really exciting. This has clearly been aimed at the high performance or throughput computing markets with the notable inclusion of ECC memory and increased double precision throughput along with the updated IEEE 754-2008 floating point support.
Concurrent kernel execution and faster context switching will allow, with the use of DAG’s, the optimization of execution on the devices rather than just working out the most efficient order of kernels to execute sequentially.
Also tucked away in the white paper is the mention of have predication at the instruction level which should give greater control of divergent paths in your kernels.
The inclusion of C++ support will appeal to a lot of people but am I rather unconvinced this is the correct way to go for throughput computing as it will encourage the use of all the old patterns that may work well in serial cases but are often rather poor for enabling maximum throughput.
There is a lot more in the paper and already an announcement by Oak Ridge that they will be using it in a new supercomputer.
All in all its a wonderful development and I can’t help feeling that computing took a substantial leap forward today.
Tags: CUDA, Fermi, NVidia, NVidia Fermi
Posted in CUDA, Uncategorized | No Comments »
Thursday, July 30th, 2009
There hasn’t been much in the way of posts here lately as I’ve been really busy at work getting some new components built into the systems I work on. Not really hard but it’s frustrating things like trying to get various components and libraries written in different languages to work together. So lately I’ve not had the energy to do much work on the computer once I get home…
There has been a bit of interest in my 3D Gaussian convolution kernels. Although I explained the technique mathematically in an earlier post I never actually posted the code. As it is rather quick and quite a novel way of calculating the convolution for a xy plane I decided to post it so everyone can benefit from / improve upon the technique. As always comments / bug reports etc are always welcome
(more…)
Tags: Gaussian Convolution, xy plane
Posted in CUDA, Maths | No Comments »
Thursday, June 25th, 2009
Or as this is a hpc/cuda/parallel processing site:
Gustafson’s Law or Amdahl’s law?
Personally I prefer Gustafson’s Law …. it seems more logical to me or is this just because I’m inherently an optimist?
I would be quite interested on hearing your views on this - so comments /forum posts most welcome.
In other news: The thermal monitor downloads have gone over 80! :) The updated version (v0.2) is ready after mucking around with subclassing a control…. I will release it soon…
Otherwise I have been extremely busy on debugging a sparse matrix solver - bugs in huge datasets can be hard to find! Even with the aid of Gold / Sivler / Bronze kernels mentioned in the last post they have been proving remarkably tricky to isolate. Rather surprising to me is the fact that the long long data type doesn’t consume a lot more processing time than a normal unsigned int - so I have been using that wherever there is a risk of exceeding 2^32.
CFD code coming soon too - although I have unwound a lot of the optimizations in order to make it easier to understand and possibly be a good foundation for your own optimizations.
Right …. back to the grindstone!
Tags: Amdahl's law, Bronze Kernel, CFD, gold kernel, GPU Thermal Monitor, Gustafson's Law, Silver Kernel, Sparse Matrix
Posted in BV2 Thermal Monitor, CFD, CUDA, Development | No Comments »
Monday, June 22nd, 2009
Kernels of course!
Most of the readers of this blog should be familiar with a “Gold” kernel in which your data is processed on the CPU (usually) and the output is carefully checked. This kernel and its associated outputs form the basis of the regression testing of subsequent implementations on the GPU including algorithmic optimizations.
Personally I like most of my gold kernels to be naive implementations of an algorithm. This causes them to be easily verifiable and usually easy to debug if there is a problem.
If you currently don’t implement a Gold kernel before writing your CUDA implementations and/or adapting you algorithm I strongly suggest you do.
The purpose of this post is to suggest two other debugging techniques I use when needed and where possible. I call them my Silver and Bronze kernels.
A Silver kernel is implemented on the GPU without any optimizations or algorithmic enhancements. The grid / block structure is as simple as possible making sure we don’t vary from the Gold kernels implementation too much - only unwinding the loops into grid/blocks is allowed where possible. This type of kernel I use when I am writing something that depends on numerical precision. Once written and verified within acceptable numerical limits against the Gold kernel it becomes the new baseline kernel before later optimizations. This allows exact matching of later kernel outputs rather than using an “acceptable deviation” approach.
(more…)
Tags: CUDA, CUDA debugging, CUDA large data sets, gold kernel, Gold Silver Bronze kernels
Posted in CUDA | No Comments »
Wednesday, June 10th, 2009
As many people have noticed the same code executed in Emulator mode gives different floating point results from the kernels run in Debug or Release mode.
Although I know what causes this I have never bothered to investigate the actual differences as most of the stuff I write runs entirely on the GPU. Recently I have had to compare results on the CPU<->GPU and wrote some code to change the FPU settings. Firstly a quick explanation:
By default the CPU (FPU) is set to use 80 bit floating point internally. This means that when you load in an integer (fild) or a single / double float (fld) it gets converted to a 80 bit number inside the FPU stack. All operations are performed internally at 80 bits and when storing the result it converts back to the correct floating point width (single / double) (fst / fstp).
This method of operation is desirable as it reduces the effect of rounding / truncating on the intermediate results. Of course while very useful for computing on the CPU this is not how the CUDA devices operate.
(more…)
Tags: Comparing CUDA results, CUDA, CUDA emulator, CUDA Numerical Stability, doubles, floats, FPU, FPU precision, FPU rounding, Numerical Stability
Posted in CUDA, Development, Maths, Uncategorized | 11 Comments »
Tuesday, May 26th, 2009
I was hoping to release the GPU temperature monitor to the downloads section sometime during this last bank holiday weekend. I had also planned to sort out my somewhat ailing / overheating computer and perform some upgrades. Unfortunately the repair / upgrades took almost 2 full days. My pc is now running a lot cooler and a bit quieter and I am now something of an expert with heat spreader / heat sink cleaning and thermal grease application. By the way artic silver is really good! Even before the 200 hour break in period mentioned on their site I’m already seeing more than 10 degrees lower temperature on the CPU.
As to the upgrades: more posts on this later
But it did include me purchasing Windows XP Professional 64bit.
The longer than expected pc maintenance time has impacted the GPU Thermal Monitor application and it won’t be ready for a few more days. A bit of good news about it though: it works perfectly on Windows XP 64bit without any modifications. Don’t you just love asm :) Some of my other C/C++ applications were not quite so happy on the new OS.
Keep watching this space for release date 
Tags: Artic Silver, GPU Temperature, GPU Temperature Monitor, GPU Thermal Monitor, Windows XP 64bit
Posted in BV2 Thermal Monitor, CUDA, Development, masm | No Comments »
Friday, May 22nd, 2009
(update 27/5/09 - I’ve now released v0.1 - see this post )
It’s been a while since my last post as I’ve been rather busy with work and some other projects - so time for a quick update.
While working on some long running kernels I wanted to keep track of the GPU’s temperature and ran the monitor app that came on their installation disks. The problem with the monitoring apps that I have is that they are quite large and take up a lot of screen space and for some unknown reason they crash / stop working when viewed over VNC (or similar) remote control app. Now as I mostly use my CUDA machines over the network this is not a good situation.
So decided to write my own :) Here is a screenshot of the initial version. It still needs config and some cleaning up (can you spot the glitch by the minimize button?) and possibly some remote reporting over the network functionality. Currently it can display the temperature from two GPU’s and updates every 500ms. When minimized it sits in the system tray and updates its tooltip with the reported temperatures. I will be releasing it to the downloads section of the website sometime over the weekend. The current .exe is 146k (mostly the skin) and doesn’t require any installer.

GPU Thermal Monitoring Application
Nothing wrong with a bit of self promotion sometimes 
Tags: GPU Temperature, masm, nvapi
Posted in BV2 Thermal Monitor, CUDA, Development, masm | No Comments »
Saturday, May 16th, 2009
In my post a few days ago I mentioned that some of the numbers reported by the Visual Profiler needed further investigation.
In particular the device to device memory bandwith reported by the profiler differed from the value reported by the bandwith test sample. This was easily tracked down: according to the documentation, the profiler divides the bytes read/written by 10^9 whereas the bandwith test sample divides by ( time in milliseconds * 2^20). So: 1000*2^20/10^9 = 1.048576 i.e. the bandwith sample reports a lower number.
(more…)
Tags: Coalescing, Compute Capability, CUDA, Memory Access Patterns, memory performance, overall memory throughput, Tesla C1060, Uncoalesced Reads, Uncoalesced Writes, Visual Profiler
Posted in CUDA | No Comments »