ComputeCube: June 2009

Thursday, 25 June 2009

Glass half full or half empty?

Or as this is a hpc/cuda/parallel processing site:

Gustafson's Law or Amdahl's law?

Personally I prefer Gustafson's Law .... it seems more logical to me or is this just because I'm inherently an optimist?

I would be quite interested on hearing your views on this - so comments /forum posts most welcome.

In other news: The thermal monitor downloads have gone over 80! :) The updated version (v0.2) is ready after mucking around with subclassing a control.... I will release it soon...

Otherwise I have been extremely busy on debugging a sparse matrix solver - bugs in huge datasets can be hard to find! Even with the aid of Gold / Sivler / Bronze kernels mentioned in the last post they have been proving remarkably tricky to isolate. Rather surprising to me is the fact that the long long data type doesn't consume a lot more processing time than a normal unsigned int - so I have been using that wherever there is a risk of exceeding 2^32.

CFD code coming soon too - although I have unwound a lot of the optimizations in order to make it easier to understand and possibly be a good foundation for your own optimizations.

Right .... back to the grindstone!

Monday, 22 June 2009

Gold, Silver and Bronze

Kernels of course! :)

Most of the readers of this blog should be familiar with a "Gold" kernel in which your data is processed on the CPU (usually) and the output is carefully checked. This kernel and its associated outputs form the basis of the regression testing of subsequent implementations on the GPU including algorithmic optimizations.

Personally I like most of my gold kernels to be naive implementations of an algorithm. This causes them to be easily verifiable and usually easy to debug if there is a problem.

If you currently don't implement a Gold kernel before writing your CUDA implementations and/or adapting you algorithm I strongly suggest you do.

The purpose of this post is to suggest two other debugging techniques I use when needed and where possible. I call them my Silver and Bronze kernels.

A Silver kernel is implemented on the GPU without any optimizations or algorithmic enhancements. The grid / block structure is as simple as possible making sure we don't vary from the Gold kernels implementation too much - only unwinding the loops into grid/blocks is allowed where possible. This type of kernel I use when I am writing something that depends on numerical precision. Once written and verified within acceptable numerical limits against the Gold kernel it becomes the new baseline kernel before later optimizations. This allows exact matching of later kernel outputs rather than using an "acceptable deviation" approach.

Compute Cube

In a previous post I mentioned fitting my Tesla C1060 onto my aging Asus motherboard. It has been working well but a combination of slow host<->device transfers speeds of less than 1GB/s, 2GB ram and a relatively slow processor encouraged me to upgrade.

Some of the prerequisites for my new personal super computer were:

a) Must be small - my desk area is limited and I don't like putting computers on the floor where they consume dust better than any known vacuum cleaner...

b) Must have at least 2x pci express 2 (gen 2) slots as for decent GPU computing you need to get data in and out of the device as quickly as possible.

c) As quiet and cool as possible.

As it turns out the last one was the most tricky and needed the C1060 to do a bit of CFD for the airflow in the case.

After a lot of research, measurement and two days of building here are some pictures of the final result. The case is only 11" x 11" x 14" - ok it's not Exactly a cube.... but close enough :) The tape measure in the photos is to give some sense of scale.

Many thanks to NVidia who very kindly sent me two NVidia Tesla logos for me to stick onto the case!

[gallery]

nd Visualization

Last Friday I had an opportunity to meet the guys behind Curvaceous who have a suite of Geometric Process Control tools.

Although at first glance it appears to be a multi-variable plot it is actually a visualization of a n'th dimensional structure. This visualization technique enables engineers to quickly see the optimal settings for their process. The plot is made using nd->2d transformations and then they have a form of query language in order to filter / rearrange the ranges and axes in order to highlight and discover the relevant information. They have multiple patents on the innovative techniques involved. This enables site engineers to rapidly control and adjust the production process in order to maintain consistant output.

Good stuff! Unfortunately their website doesn't give a full idea or demonstation of what they do but if you are in need of Process Control software that can handle thousands of sensor inputs, masses of historical data and present it in an easy to use fashion you won't go wrong with their Geometric Process Control Tool Suite.

Feeling inspired by their technique I came up with a very simplistic visualization of a ball bouncing (in-elastic case). I made a slight modification of allowing an axis to be rotated into a 3rd dimension in order to aid visualization. In this case I rotated the time axis. See below for screenshots. Although this is a very simplistic case it really demonstrates the power of the technique.

CUDA Emulator Output

As many people have noticed the same code executed in Emulator mode gives different floating point results from the kernels run in Debug or Release mode.

Although I know what causes this I have never bothered to investigate the actual differences as most of the stuff I write runs entirely on the GPU. Recently I have had to compare results on the CPU<->GPU and wrote some code to change the FPU settings. Firstly a quick explanation:

By default the CPU (FPU) is set to use 80 bit floating point internally. This means that when you load in an integer (fild) or a single / double float (fld) it gets converted to a 80 bit number inside the FPU stack. All operations are performed internally at 80 bits and when storing the result it converts back to the correct floating point width (single / double) (fst / fstp).

This method of operation is desirable as it reduces the effect of rounding / truncating on the intermediate results. Of course while very useful for computing on the CPU this is not how the CUDA devices operate.

Forum Registrations

Regarding new forum registrations:

The forums are set to email me should a new person sign up. I usually check these emails two or three times a day but there is a large volume of what I suspect are spammers who are getting around the somewhat trivial captcha.

If you have tried to register legitimately and not received a response in 24hrs please drop me a comment here and I'll sort it out for you.

The Thermal Monitor hit 30 downloads today - ok not exactly a "killer app" - but so far not a single bug report :) The updated one will be released soon, I've just been snowed under with work.

Thursday, 4 June 2009

Thermal Monitor

As at the time of writing there have been 20 downloads of the BV2 Thermal Monitor and so far not a single complaint / bug report :) I'm going to take this as a good sign and not that people are just not reporting issues. If you do have an issue please report it on the forums.

I will try to get an update with the "always on top" button and some minor bug fixes out some time this weekend.

Expect more screenshots from my Lattice Boltzmann Method - D3Q19 implementation too - although I had to roll back 3 evenings of work on it due to a myriad of introduced bugs :( Naughty me for not running it against the regression data every night.... I'm thinking about releasing the source code for this too as there is not much in the way of simple D3Q19 lattice source on the web at the moment.

SPIE Medical Imaging Conference - call for papers

Yesterday I received a call for papers brochure in the post. I am not a member of SPIE so was a bit surprised to receive this at my home address. There was no covering letter in the envelope so I am assuming they may want a little promotion here:

The conference is in San Diego, CA, USA between 13-18th February 2010 and the abstract submission deadline is 20th July 2009.

Some of the suggested paper topics that could be of interest to readers of this blog are:

molecular imaging

tomographic image reconstruction

image processing and analysis

visual rendering of complex datasets

There are a lot more - so have a look at their site here and the brochure here

By all accounts these conferences are well presented and attended, although, as most of the visitors to this site are from Europe it could be a little far away.

ComputeCube