Archive for the ‘Development’ Category

Catch up

Tuesday, June 22nd, 2010

No apologies for the long delays between posts, or even checking the blog. It just has to fit in with life at the moment and there is so much going on!

Those of you who have left comments on the blog and emails for me should have got an answer last night or this morning. A bit of a delay for some of you.  I do apologise for those of you who sent an email and haven’t got a reply back. My computer that used to handle all my email died and I haven’t had a chance to fix it yet. I think its just the power supply…. hopefully! so will get the emails back in a few weeks when I eventually get around to fixing it.

I’ve subsequently upgraded my email server - so any forthcoming mails will get to me. 

In development work I’ve been working on my SPH simulations, and some GP stuff whenever I get a chance. GP stuff is traditionally recursive - well the equation trees anyway and have needed a substantial amount of reworking to get working efficiently on the GPU.

Speaking of recursive…. in order to be Turing Complete (assuming infinite memory for now) do you need to support / include recursion? Some posters on certain forums seem to think it is needed, but personally I can’t see why?  Most recursion with a bit of effort can be iterative - although possibly not very pretty or efficient.

For example. A GPU doesn’t really support recursion*, but I would consider cuda / GPU combination as Turing complete. Admittedly not very efficient in certain cases - single thread for example. And again ignoring the infinite memory issue.  *You can if you implement your own stack type system in global memory…

I’d be interested in knowing others views on this - email the normal place or comment here :)

To all the regular readers of the blog - anyone else being amazed by the absolute explosion of GPU / CUDA related code / products / hardware.  Very exciting indeed!

SPH Screenshot

Friday, March 19th, 2010

Finally the promised screenshot :)

SPH with symmetry

SPH with symmetry

It’s not all that impressive to look at as I’ve restricted all the particles to 2d although it does use 3d calculations. I do this to help look for any issues in the code as I find it hard to spot errors in a 3d particle rendering.

This particular screenshot has 64000 particles that have been dropped into the box in a column formation and are now starting to slosh around at the bottom.

The unusual thing with regards to a CUDA implementation is that it is using symmetry in the interactions thereby decreasing the memory/processing load. I’ve still got more work to do but its showing a lot of promise in running superfast particle interaction simulations.

I’ve aso been doing a bit of work on my second version of my raytracer. I’ve once again stepped away from KD-trees and Octrees and am using a type of BVH, ray marching system. Screenshots once I have a decent scene rendered :)

In other news I’m now compiling all my new C++/CUDA code in 64bit with the CUDA 3.0 beta. Although I think putting in c++ object support into CUDA was a mistake the new version does produce decent code.

Poor neglected blog…

Friday, February 26th, 2010

Nearly 3 months since my last post :(

Work has been exceptionally busy: In the last two months on top of my normal product maintenance and improvement duties I have prepared and filed a patent application, architected and largely completed a distributed, resilient document processing framework and found a bit of time to eat and sleep!

I’ve noticed other blogs in the raytracing / graphics / visualization space have been very quiet lately - maybe everyone else is also working like crazy?

Not a huge amount has happened in my raytracer and SPH projects although got some interesting effects running with a non-uniform mass particle system when I had time over Christmas. Screenshots soon.

I do have the beta release of Nexus (the NVidia Visual Studio plugin)  but sadly it only runs on Windows Vista or Windows 7 which leads nicely on to my next point:

I am a bit irritated with Microsoft for two reasons:  Even though I purchased a 64 bit Windows XP professional about 6 or 8 months ago there is no upgrade path to Windows 7…  Secondly even though visual studio 2008 standard has a switch for openMP it doesnt contain the openmp headers. Only the more expensive professional version does. Not something that was immediately obvious from the documentation before I purchased…

Although I also run Linux (centos) I prefer to develop on a Windows GUI - less buggy and more responsive than gnome / kde in my opinion. For running code the Linux os does usually win though! I would really like to run Nexus so am a bit stuck about what to do….  Succumb and buy Windows 7 and get Nexus on Visual Studio? or just forget entirely about Windows development / environment and use Linux / gcc / Intel compilers instead?  While the Intel compilers are great (if a bit expensive) for an IDE I really do like Visual Studio.  Most of my code is cross platform and for graphics I mostly use openGL so could switch without too much trouble…    But direct compute is so tempting…..

Arrrrgh what to do!

C / C++ and STL

Wednesday, October 28th, 2009

Before everyone gets really upset with the rest of this post, as is the trend in the OO community…  I thought I’d start, rather than end, with a disclaimer:  I use C++ and STL on a daily basis in my job, although I don’t use all of what stl has to offer it does make coding in c++ much easier. C++ in itself does allow fairly elegant code (if constructed carefully) whilst providing a decent level of code performance. So I do actually like C++ and stl and they make my life at work much better :)

But this blog isn’t about my day job….  It’s about my tinkering with the wonderful world of parallel algorithms and CUDA code.

What a lot of people don’t realize is that you *can* use stl, c++ classes and templates in a .cu file. As long as its client side code you should be fine. I’ve had a few compiler crashes when using stl especially the sort. To sort this out I used the overloaded < operator in your class, don’t try and define a custom < method it will crash the compiler.

(more…)

Amdahl’s law

Monday, October 19th, 2009

A few months ago I made a post mentioning how I don’t conform to the Amdahl’s law way of thinking but never went into any details.

The law describes the speedup that can be obtained if you can parallelize a section of your problem. The speedup that can be obtained is described by the following equation:

{\frac{1}{(1-P)+\frac{P}{S}}}

Where P is the proportion of the problem that can be parallelized / sped up and S is the speedup amount.

Assuming that S->infinity  then P/S -> 0  this leave us with {\frac{1}{(1-P)}}

This implies that no matter how many processors / speed improvements we make to the P portion of the problem we can never do better than  {\frac{1}{(1-P)}}   And the biggest % improvement from the baseline comes with low values of S (or relatively low numbers of parallel processors). This result is observed in the field time and again. Very seldom does throwing more than 4 or 8 processors at a problem speed it up any more than the large gains you get from the first 2 or 4 processors.

This equation does expand with multiple P and associated S terms in order to describe a more complex / lengthly problem: (P1+P2+P3 = 100%)

{\frac{1}{(1-P1)+\frac{P1}{S1}}}+{\frac{1}{(1-P2)+\frac{P2}{S2}}}+{\frac{1}{(1-P3)+\frac{P3}{S3}}}

Certain problems where P is large do respond well to the increase in processors these are known as “embarrassingly parallel”, ray tracing is rather a good example of this.

 

So why do I not agree with this if the equation makes sense?

The assumption that only P areas can be accelerated by S and strung together in a serial fashion is rather simplistic.

Why do we have to finish P1 before beginning P2?  Even if the P2 area has dependancies on P1 its rare to have the entire section of P2 to depend on a single result (of course there are cases - reduction kernels etc)

Maybe P3 can overlap P1 and P2, some may benefit by having more processors while others may reach an optimal at two. Why not overlap the sections and supply them with their optimal processing power? This is easy to achieve with Directed Acyclic Graphs (DAG’s) and can even be computed on the “fly” although they do get rather large!

Quoting Amdahl’s law as a reason why no further speed benefits are available in a system is really just showing that thinking is still stuck in serial mode with little bursts of parallelism thrown in.  Lets starting thinking parallel in all areas and make the most of all available compute resources.

Fermi

Thursday, October 1st, 2009

I’ve just completed reading the white paper released by nvidia which you can find here.

Rather interestingly no mention of graphics performance has been made which, in a way, is really exciting. This has clearly been aimed at the high performance or throughput computing markets with the notable inclusion of ECC memory and increased double precision throughput along with the updated IEEE 754-2008 floating point support.

Concurrent kernel execution and faster context switching will allow, with the use of DAG’s, the optimization of execution on the devices rather than just working out the most efficient order of kernels to execute sequentially.

Also tucked away in the white paper is the mention of have predication at the instruction level which should give greater control of divergent paths in your kernels.

The inclusion of C++ support will appeal to a lot of people but am I rather unconvinced this is the correct way to go for throughput computing as it will encourage the use of all the old patterns that may work well in serial cases but are often rather poor for enabling maximum throughput.

There is a lot more in the paper and already an announcement by Oak Ridge that they will be using it in a new supercomputer.

All in all its a wonderful development and I can’t help feeling that computing took a substantial leap forward today.

3D Gaussian Convolution

Thursday, July 30th, 2009

There hasn’t been much in the way of posts here lately as I’ve been really busy at work getting some new components built into the systems I work on. Not really hard but it’s frustrating things like trying to get various components and libraries written in different languages to work together. So lately I’ve not had the energy to do much work on the computer once I get home…

There has been a bit of interest in my 3D Gaussian convolution kernels. Although I explained the technique mathematically in an earlier post I never actually posted the code. As it is rather quick and quite a novel way of calculating the convolution for a xy plane I decided to post it so everyone can benefit from / improve upon the technique.  As always comments / bug reports etc are always welcome :)

(more…)

Glass half full or half empty?

Thursday, June 25th, 2009

Or as this is a hpc/cuda/parallel processing site:

Gustafson’s Law or Amdahl’s law?

Personally I prefer Gustafson’s Law …. it seems more logical to me or is this just because I’m inherently an optimist?

I would be quite interested on hearing your views on this - so comments /forum posts most welcome.

In other news:  The thermal monitor downloads have gone over 80! :)  The updated version (v0.2) is ready after mucking around with subclassing a control…. I will release it soon…

Otherwise I have been extremely busy on debugging a sparse matrix solver - bugs in huge datasets can be hard to find! Even with the aid of Gold / Sivler / Bronze kernels mentioned in the last post they have been proving remarkably tricky to isolate. Rather surprising to me is the fact that the long long data type doesn’t consume a lot more processing time than a normal unsigned int - so I have been using that wherever there is a risk of exceeding 2^32.

CFD code coming soon too - although I have unwound a lot of the optimizations in order to make it easier to understand and possibly be a good foundation for your own optimizations.

Right …. back to the grindstone!

Gold, Silver and Bronze

Monday, June 22nd, 2009

Kernels of course! :)

Most of the readers of this blog should be familiar with a “Gold” kernel in which your data is processed on the CPU (usually) and the output is carefully checked. This kernel and its associated outputs form the basis of the regression testing of subsequent implementations on the GPU including algorithmic optimizations.

Personally I like most of my gold kernels to be naive implementations of an algorithm. This causes them to be  easily verifiable and usually easy to debug if there is a problem.

If you currently don’t implement a Gold kernel before writing your CUDA implementations and/or adapting you algorithm I strongly suggest you do.

The purpose of this post is to suggest two other debugging techniques I use when needed and where possible. I call them my Silver and Bronze kernels.

A Silver kernel is implemented on the GPU without any optimizations or algorithmic enhancements. The grid / block structure is as simple as possible making sure we don’t vary from the Gold kernels implementation too much - only unwinding the loops into grid/blocks is allowed where possible. This type of kernel I use when I am writing something that depends on numerical precision. Once written and verified within acceptable numerical limits against the Gold kernel it becomes the new baseline kernel before later optimizations. This allows exact matching of later kernel outputs rather than using an “acceptable deviation” approach.

(more…)

CUDA Emulator Output

Wednesday, June 10th, 2009

As many people have noticed the same code executed in Emulator mode gives different floating point results from the kernels run in Debug or Release mode.

Although I know what causes this I have never bothered to investigate the actual differences as most of the stuff I write runs entirely on the GPU. Recently I have had to compare results on the CPU<->GPU and wrote some code to change the FPU settings. Firstly a quick explanation:

By default the CPU (FPU) is set to use 80 bit floating point internally. This means that when you load in an integer (fild) or a single / double float (fld) it gets converted to a 80 bit number inside the FPU stack. All operations are performed internally at 80 bits and when storing the result it converts back to the correct floating point width (single / double)  (fst / fstp). 

This method of operation is desirable as it reduces the effect of rounding / truncating on the intermediate results.  Of course while very useful for computing on the CPU this is not how the CUDA devices operate.

(more…)