ComputeCube: October 2009

Wednesday, 28 October 2009

C / C++ and STL

Before everyone gets really upset with the rest of this post, as is the trend in the OO community... I thought I'd start, rather than end, with a disclaimer: I use C++ and STL on a daily basis in my job, although I don't use all of what stl has to offer it does make coding in c++ much easier. C++ in itself does allow fairly elegant code (if constructed carefully) whilst providing a decent level of code performance. So I do actually like C++ and stl and they make my life at work much better :)

But this blog isn't about my day job.... It's about my tinkering with the wonderful world of parallel algorithms and CUDA code.

What a lot of people don't realize is that you *can* use stl, c++ classes and templates in a .cu file. As long as its client side code you should be fine. I've had a few compiler crashes when using stl especially the sort. To sort this out I used the overloaded < operator in your class, don't try and define a custom < method it will crash the compiler.

GPU Temperature Monitor

As of writing the combined download count of the GPU Thermal Monitor has hit 520 :)

So far I'm yet to receive any major feedback on bugs etc which leads me to believe it: a) works perfectly or b) no-one is bothering to report issues. As I'm an optimist I'm going with option a :)

I've had more requests for remote monitoring of the GPU temperature via a simple http request. This is something I need myself in order to keep track of temperatures in remote machines. This is now built in and in testing and bug fixing, hopefully to be released soon. I've not used completion ports as they seemed like overkill for what should be a light traffic application but as the source is included and under creative commons license please feel free to add them if needed. Secondly having it open source allows for some code review, which is important for security reasons as it now allows remote connections.

If you have found a bug or would like another feature added please drop me a comment or email.

Monday, 19 October 2009

Amdahl's law

A few months ago I made a post mentioning how I don't conform to the Amdahl's law way of thinking but never went into any details.

The law describes the speedup that can be obtained if you can parallelize a section of your problem. The speedup that can be obtained is described by the following equation:

\${\sfrac{1}{(1-P)+\sfrac{P}{S}}}\$

Where P is the proportion of the problem that can be parallelized / sped up and S is the speedup amount.

Assuming that S->infinity then P/S -> 0 this leave us with \${\sfrac{1}{(1-P)}}\$

This implies that no matter how many processors / speed improvements we make to the P portion of the problem we can never do better than \${\sfrac{1}{(1-P)}}\$ And the biggest % improvement from the baseline comes with low values of S (or relatively low numbers of parallel processors). This result is observed in the field time and again. Very seldom does throwing more than 4 or 8 processors at a problem speed it up any more than the large gains you get from the first 2 or 4 processors.

This equation does expand with multiple P and associated S terms in order to describe a more complex / lengthly problem: (P1+P2+P3 = 100%)

\${\sfrac{1}{(1-P1)+\sfrac{P1}{S1}}}+{\sfrac{1}{(1-P2)+\sfrac{P2}{S2}}}+{\sfrac{1}{(1-P3)+\sfrac{P3}{S3}}}\$

Certain problems where P is large do respond well to the increase in processors these are known as "embarrassingly parallel", ray tracing is rather a good example of this.

So why do I not agree with this if the equation makes sense?

The assumption that only P areas can be accelerated by S and strung together in a serial fashion is rather simplistic.

Why do we have to finish P1 before beginning P2? Even if the P2 area has dependancies on P1 its rare to have the entire section of P2 to depend on a single result (of course there are cases - reduction kernels etc)

Maybe P3 can overlap P1 and P2, some may benefit by having more processors while others may reach an optimal at two. Why not overlap the sections and supply them with their optimal processing power? This is easy to achieve with Directed Acyclic Graphs (DAG's) and can even be computed on the "fly" although they do get rather large!

Quoting Amdahl's law as a reason why no further speed benefits are available in a system is really just showing that thinking is still stuck in serial mode with little bursts of parallelism thrown in. Lets starting thinking parallel in all areas and make the most of all available compute resources.

Thursday, 1 October 2009

Fermi

I've just completed reading the white paper released by nvidia which you can find here.

Rather interestingly no mention of graphics performance has been made which, in a way, is really exciting. This has clearly been aimed at the high performance or throughput computing markets with the notable inclusion of ECC memory and increased double precision throughput along with the updated IEEE 754-2008 floating point support.

Concurrent kernel execution and faster context switching will allow, with the use of DAG's, the optimization of execution on the devices rather than just working out the most efficient order of kernels to execute sequentially.

Also tucked away in the white paper is the mention of have predication at the instruction level which should give greater control of divergent paths in your kernels.

The inclusion of C++ support will appeal to a lot of people but am I rather unconvinced this is the correct way to go for throughput computing as it will encourage the use of all the old patterns that may work well in serial cases but are often rather poor for enabling maximum throughput.

There is a lot more in the paper and already an announcement by Oak Ridge that they will be using it in a new supercomputer.

All in all its a wonderful development and I can't help feeling that computing took a substantial leap forward today.