Thursday 3 December 2009


I noticed this linked from both Atom and Real-time Rendering blogs. Cyril Crassin, the guy behind the amazing gigavoxels raytracing, has got a 3d Mandelbrot fractal rendering in real time.

As we know there isn't actually a 3rd dimension to the imaginary plane so some manipulation is required. The chap who discovered a good way of transforming it to 3d ( the Mandelbulb ) has a website which you can find here. Well worth reading and some pretty amazing images!

As a side note: I've owned the book: "Real-time Rendering" for a number of years now and it is an invaluable resource. The Real-time Rendering blog mentioned above is the blog by the authors of the book.

Wednesday 28 October 2009

C / C++ and STL

Before everyone gets really upset with the rest of this post, as is the trend in the OO community...  I thought I'd start, rather than end, with a disclaimer:  I use C++ and STL on a daily basis in my job, although I don't use all of what stl has to offer it does make coding in c++ much easier. C++ in itself does allow fairly elegant code (if constructed carefully) whilst providing a decent level of code performance. So I do actually like C++ and stl and they make my life at work much better :)

But this blog isn't about my day job....  It's about my tinkering with the wonderful world of parallel algorithms and CUDA code.

What a lot of people don't realize is that you *can* use stl, c++ classes and templates in a .cu file. As long as its client side code you should be fine. I've had a few compiler crashes when using stl especially the sort. To sort this out I used the overloaded < operator in your class, don't try and define a custom < method it will crash the compiler.

Tuesday 27 October 2009

GPU Temperature Monitor

As of writing the combined download count of the GPU Thermal Monitor has hit 520 :)

So far I'm yet to receive any major feedback on bugs etc which leads me to believe it: a) works perfectly or b) no-one is bothering to report issues. As I'm an optimist I'm going with option a  :)

I've had more requests for remote monitoring of the GPU temperature via a simple http request. This is something I need myself in order to keep track of temperatures in remote machines. This is now built in and in testing and bug fixing, hopefully to be released soon. I've not used completion ports as they seemed like overkill for what should be a light traffic application but as the source is included and under creative commons license please feel free to add them if needed. Secondly having it open source allows for some code review, which is important for security reasons as it now allows remote connections.

If you have found a bug or would like another feature added please drop me a comment or email.

Monday 19 October 2009

Amdahl's law

A few months ago I made a post mentioning how I don't conform to the Amdahl's law way of thinking but never went into any details.

The law describes the speedup that can be obtained if you can parallelize a section of your problem. The speedup that can be obtained is described by the following equation:


Where P is the proportion of the problem that can be parallelized / sped up and S is the speedup amount.

Assuming that S->infinity  then P/S -> 0  this leave us with \${\sfrac{1}{(1-P)}}\$

This implies that no matter how many processors / speed improvements we make to the P portion of the problem we can never do better than  \${\sfrac{1}{(1-P)}}\$   And the biggest % improvement from the baseline comes with low values of S (or relatively low numbers of parallel processors). This result is observed in the field time and again. Very seldom does throwing more than 4 or 8 processors at a problem speed it up any more than the large gains you get from the first 2 or 4 processors.

This equation does expand with multiple P and associated S terms in order to describe a more complex / lengthly problem: (P1+P2+P3 = 100%)


Certain problems where P is large do respond well to the increase in processors these are known as "embarrassingly parallel", ray tracing is rather a good example of this.

So why do I not agree with this if the equation makes sense?

The assumption that only P areas can be accelerated by S and strung together in a serial fashion is rather simplistic.

Why do we have to finish P1 before beginning P2?  Even if the P2 area has dependancies on P1 its rare to have the entire section of P2 to depend on a single result (of course there are cases - reduction kernels etc)

Maybe P3 can overlap P1 and P2, some may benefit by having more processors while others may reach an optimal at two. Why not overlap the sections and supply them with their optimal processing power? This is easy to achieve with Directed Acyclic Graphs (DAG's) and can even be computed on the "fly" although they do get rather large!

Quoting Amdahl's law as a reason why no further speed benefits are available in a system is really just showing that thinking is still stuck in serial mode with little bursts of parallelism thrown in.  Lets starting thinking parallel in all areas and make the most of all available compute resources.

Thursday 1 October 2009


I've just completed reading the white paper released by nvidia which you can find here.

Rather interestingly no mention of graphics performance has been made which, in a way, is really exciting. This has clearly been aimed at the high performance or throughput computing markets with the notable inclusion of ECC memory and increased double precision throughput along with the updated IEEE 754-2008 floating point support.

Concurrent kernel execution and faster context switching will allow, with the use of DAG's, the optimization of execution on the devices rather than just working out the most efficient order of kernels to execute sequentially.

Also tucked away in the white paper is the mention of have predication at the instruction level which should give greater control of divergent paths in your kernels.

The inclusion of C++ support will appeal to a lot of people but am I rather unconvinced this is the correct way to go for throughput computing as it will encourage the use of all the old patterns that may work well in serial cases but are often rather poor for enabling maximum throughput.

There is a lot more in the paper and already an announcement by Oak Ridge that they will be using it in a new supercomputer.

All in all its a wonderful development and I can't help feeling that computing took a substantial leap forward today.

Friday 25 September 2009

Work and Spam...

Yes I am still alive... :)    I've been exceptionally busy at work lately, not complaining though as work pays the bills and gpu / compute things are just a hobby.

What I do find rather frustrating is in the limited time I have to maintain the blog, it all gets eaten up by wading through mountains of spam. Akismet does a fantastic job, but on the forums side the bot detection captcha is pathetic and I have around 100 signups / 100's of messages to check every day. So until a later date when I can fix / upgrade it I will be suspending the forums from today.

On a rather more positive note:  Nvidia has announced OptiX ray tracing engine. Some examples are out already so worth having a look at.  Nvidia has also announced Nexus which looks incredible. I have heard that they may charge for it but with the features it apparently has I would be willing to pay.

For some of the latest conferences papers and pre-prints have a look at Ke-Sen Huang's page which he keeps updated with some really nice stuff.

As for me - no new cuda / compute development for a while. When I get a few mins free I'm tending to work on the maths for the stuff I want to do but never get enough time to implement anymore. Hopefully this will change in 2 or 3 months once work settles down a bit. On the subject of work there have been one or two questions over email and on the blog of the steps taken after an edge detector has been applied - unfortunately I cannot answer this as it gets a bit close to what I do for work (OCR and document processing). Rather interesting one of my ray tracing acceleration algorithms which turned out pathetically for its purpose has been rather good at identifying features in 2d/3d images but more on this at a later date once its a bit more complete.

Thursday 30 July 2009

3D Gaussian Convolution

There hasn't been much in the way of posts here lately as I've been really busy at work getting some new components built into the systems I work on. Not really hard but it's frustrating things like trying to get various components and libraries written in different languages to work together. So lately I've not had the energy to do much work on the computer once I get home...

There has been a bit of interest in my 3D Gaussian convolution kernels. Although I explained the technique mathematically in an earlier post I never actually posted the code. As it is rather quick and quite a novel way of calculating the convolution for a xy plane I decided to post it so everyone can benefit from / improve upon the technique.  As always comments / bug reports etc are always welcome :)

Thursday 23 July 2009

Happy blog day!

My little blog is officially one year old today! :)

Over the last year I've made many more posts than I was originally planning to make and judging by the number of subscribers and variety of institutions and people who visit it's not entirely all rubbish, or so I like to tell myself.

Ongoing projects that will hopefully get completely or improved in the coming year are:

Thermal monitor - networking coming soon

More raytracing (of course) - linking PhysX to it (thanks to Timothy's post here for inspiration) to improve on my rather primitive bouncing balls.

SPH and Lattice methods for fluid simulation

Sorting algorithms - I haven't posted much on this lately but some decent results so far

Massive Data set processing.

Anything else that grabs my interest

Tuesday 21 July 2009

Throughput Computing

Two weeks ago I attended a seminar at the Daresbury laboratory near Warrington. Unfortunately I only had the day off so couldn't attend the workshops on offer and buttercup, my trusty landy, had to drive there and back on the same day. Sorry to everyone on the m4, m5 and m6 that day :p

Jack Dongarra presented a rather nice summary of supercomputing / parallel computing or as he called it "Throughput Computing".  I had never heard this term before and now really like it. It implies that you should make the best of available resources in order to maximize throughput. It's very easy to get into a mindset where everything has to be in parallel but it's possible by doing so you neglect to notice that some parts work very well in serial.

Friday 3 July 2009

Numerical Precision

Numerical precision is an ongoing concern of mine especially in big / long running simulations and solvers.

I came across an article by Rob Farber on the site this morning that asks the question "How much is Enough?".  Although no definitive answers are presented the author summarizes the current and future concerns over accuracy.

Personally I don't believe floating point is the way forward. Floating point is fast to calculate in hardware but is not always an ideal way of representing numbers. Although the various branches of mathematics are largely base independent humans are most comfortable with base 10 while computers are of course most comfortable with base 2. This does result in some situations when a calculation in base 10 with only a few decimals of precision gives precise results whereas a calculation in base 2 is incapable of giving a precise result even given N bits of precision although the result is probably acceptable after n bits.

I'm not presenting any solution to the precision problem, but merely pointing out that sometimes the issue is caused by:   using base 2 for calculations  and/or  the floating point representation of these numbers.

Wednesday 1 July 2009

BV2 Thermal monitor v0.11

Here is a small update to the Thermal Monitor which implements the "always on top" feature. This can be enabled and disabled by right clicking on its little system try icon.

This is not the point release as planned as some of the additional features were appearing a bit unstable on my Win XP pro 64 machine. Until I find time to sort them out I thought I would release the most requested feature.

The source code is included. It is still in masm and does include routines for non-standard window shapes and transparent blts etc - could be worth a look if you are interested in masm32.

It is now available in the downloads section or here:

[download id="4"] - Please see Licenses Section for license details. By downloading you agree to be bound by the terms of the Creative Commons Attribution Share Alike license.

Thursday 25 June 2009

Glass half full or half empty?

Or as this is a hpc/cuda/parallel processing site:

Gustafson's Law or Amdahl's law?

Personally I prefer Gustafson's Law .... it seems more logical to me or is this just because I'm inherently an optimist?

I would be quite interested on hearing your views on this - so comments /forum posts most welcome.

In other news:  The thermal monitor downloads have gone over 80! :)  The updated version (v0.2) is ready after mucking around with subclassing a control.... I will release it soon...

Otherwise I have been extremely busy on debugging a sparse matrix solver - bugs in huge datasets can be hard to find! Even with the aid of Gold / Sivler / Bronze kernels mentioned in the last post they have been proving remarkably tricky to isolate. Rather surprising to me is the fact that the long long data type doesn't consume a lot more processing time than a normal unsigned int - so I have been using that wherever there is a risk of exceeding 2^32.

CFD code coming soon too - although I have unwound a lot of the optimizations in order to make it easier to understand and possibly be a good foundation for your own optimizations.

Right .... back to the grindstone!

Monday 22 June 2009

Gold, Silver and Bronze

Kernels of course! :)

Most of the readers of this blog should be familiar with a "Gold" kernel in which your data is processed on the CPU (usually) and the output is carefully checked. This kernel and its associated outputs form the basis of the regression testing of subsequent implementations on the GPU including algorithmic optimizations.

Personally I like most of my gold kernels to be naive implementations of an algorithm. This causes them to be  easily verifiable and usually easy to debug if there is a problem.

If you currently don't implement a Gold kernel before writing your CUDA implementations and/or adapting you algorithm I strongly suggest you do.

The purpose of this post is to suggest two other debugging techniques I use when needed and where possible. I call them my Silver and Bronze kernels.

A Silver kernel is implemented on the GPU without any optimizations or algorithmic enhancements. The grid / block structure is as simple as possible making sure we don't vary from the Gold kernels implementation too much - only unwinding the loops into grid/blocks is allowed where possible. This type of kernel I use when I am writing something that depends on numerical precision. Once written and verified within acceptable numerical limits against the Gold kernel it becomes the new baseline kernel before later optimizations. This allows exact matching of later kernel outputs rather than using an "acceptable deviation" approach.

Thursday 18 June 2009

Compute Cube

In a previous post I mentioned fitting my Tesla C1060 onto my aging Asus motherboard. It has been working well but a combination of slow host<->device transfers speeds of less than 1GB/s,  2GB ram and a relatively slow processor encouraged me to upgrade.

Some of the prerequisites for my new personal super computer were:

a) Must be small - my desk area is limited and I don't like putting computers on the floor where they consume dust better than any known vacuum cleaner...

b) Must have at least 2x pci express 2 (gen 2) slots as for decent GPU computing you need to get data in and out of the device as quickly as possible.

c) As quiet and cool as possible.

As it turns out the last one was the most tricky and needed the C1060 to do a bit of CFD for the airflow in the case.

After a lot of research, measurement and two days of building here are some pictures of the final result. The case is only 11" x 11" x 14" - ok it's not Exactly a cube.... but close enough :) The tape measure in the photos is to give some sense of scale.

Many thanks to NVidia who very kindly sent me two NVidia Tesla logos for me to stick onto the case!


Tuesday 16 June 2009

nd Visualization

Last Friday I had an opportunity to meet the guys behind Curvaceous who have a suite of Geometric Process Control tools.

Although at first glance it appears to be a multi-variable plot it is actually a visualization of a n'th dimensional structure. This visualization technique enables engineers to quickly see the optimal settings for their process. The plot is made using nd->2d transformations and then they have a form of query language in order to filter / rearrange the ranges and axes in order to highlight and discover the relevant information. They have multiple patents on the innovative techniques involved. This enables site engineers to rapidly control and adjust the production process in order to maintain consistant output.

Good stuff!  Unfortunately their website doesn't give a full idea or demonstation of what they do but if you are in need of Process Control software that can handle thousands of sensor inputs, masses of historical data and present it in an easy to use fashion you won't go wrong with their Geometric Process Control Tool Suite.

Feeling inspired by their technique I came up with a very simplistic visualization of a ball bouncing (in-elastic case).  I made a slight modification of allowing an axis to be rotated into a 3rd dimension in order to aid visualization. In this case I rotated the time axis. See below for screenshots.  Although this is a very simplistic case it really demonstrates the power of the technique.

Wednesday 10 June 2009

CUDA Emulator Output

As many people have noticed the same code executed in Emulator mode gives different floating point results from the kernels run in Debug or Release mode.

Although I know what causes this I have never bothered to investigate the actual differences as most of the stuff I write runs entirely on the GPU. Recently I have had to compare results on the CPU<->GPU and wrote some code to change the FPU settings. Firstly a quick explanation:

By default the CPU (FPU) is set to use 80 bit floating point internally. This means that when you load in an integer (fild) or a single / double float (fld) it gets converted to a 80 bit number inside the FPU stack. All operations are performed internally at 80 bits and when storing the result it converts back to the correct floating point width (single / double)  (fst / fstp). 

This method of operation is desirable as it reduces the effect of rounding / truncating on the intermediate results.  Of course while very useful for computing on the CPU this is not how the CUDA devices operate.

Forum Registrations

Regarding new forum registrations:

The forums are set to email me should a new person sign up. I usually check these emails two or three times a day but there is a large volume of what I suspect are spammers who are getting around the somewhat trivial captcha.

If you have tried to register legitimately and not received a response in 24hrs please drop me a comment here and I'll sort it out for you.

The Thermal Monitor hit 30 downloads today - ok not exactly a "killer app" - but so far not a single bug report :)    The updated one will be released soon, I've just been snowed under with work.

Thursday 4 June 2009

Thermal Monitor

As at the time of writing there have been 20 downloads of the BV2 Thermal Monitor and so far not a single complaint / bug report :)   I'm going to take this as a good sign and not that people are just not reporting issues. If you do have an issue please report it on the forums.

I will try to get an update with the "always on top" button and some minor bug fixes out some time this weekend.

Expect more screenshots from my Lattice Boltzmann Method - D3Q19 implementation too - although I had to roll back 3 evenings of work on it due to a myriad of introduced bugs :( Naughty me for not running it against the regression data every night....  I'm thinking about releasing the source code for this too as there is not much in the way of simple D3Q19 lattice source on the web at the moment.

SPIE Medical Imaging Conference - call for papers

Yesterday I received a call for papers brochure in the post. I am not a member of SPIE so was a bit surprised to receive this at my home address.  There was no covering letter in the envelope so I am assuming they may want a little promotion here:

The conference is in San Diego, CA, USA between 13-18th February 2010 and the abstract submission deadline is 20th July 2009.

Some of the suggested paper topics that could be of interest to readers of this blog are:

molecular imaging

tomographic image reconstruction

image processing and analysis

visual rendering of complex datasets

There are a lot more - so have a look at their site here and the brochure here


By all accounts these conferences are well presented and attended, although, as most of the visitors to this site are from Europe it could be a little far away.

Friday 29 May 2009

Thermal Monitor Installation

Last night I received an email asking how to install the BV2 Thermal Monitor application. The simple answer is:  you don't.   I can see that it may not be clear from the original post. As the application is entirely self contained and does not depend on any C runtime you only need to:

extract the zip file to a directory of your choosing

double click the .exe file


You do need to have the NVidia drivers installed. You can make desktop shortcuts / start menu shortcuts to it in the usual manner. Please post any more queries to the forums.

Thursday 28 May 2009

BV2 Forums - take 2

The forums are back online :)  I switched from BBpress to phpBB, not that there is anything wrong with bbpress but I am more familiar with phpBB and the various options etc in the setups. phpBB is a bit overkill for such a little site but all the features and options are rather fun.

The three users who registered on the old forums will need to re-register but there were no posts apart from mine - so no loss there :)

I've had the odd email go missing from the forums but I think I've sorted out the mail server properly now. Please let me know if your registration mail goes missing.

BV2 Forums

For some reason most vistors to the site tend to email me rather than leave a comment. I really don't mind and try and answer them as quickly as possible. If you don't get a reply within a day the chances are my spam filter has eaten them...

In order to promote more open discussions about the topics on this site I have installed some forum software from bbpress (the makers of wordpress). You will find the forums here or click on the link on the top right hand of every page. BBpress integrates into wordpress so your blog user accounts should work in the forums too.

I am moving the employment / contract pages to the forum so people can post things themselves. The forums are protected by spam filters etc but if something does get through and I don't notice - please let me know.  Although at the moment there is very little chance of me not noticing as there are almost no topics!

Wednesday 27 May 2009

BV2 Thermal Monitor

Finally v0.1 of the BV2 GPU Thermal Monitor is ready for download. It is available in the downloads section of the site. The source code is included! Please note that I only partially translated the nvapi.h file to an .inc file.

Please note that it is licensed differently from other content here and is under the Creative Commons "Attribution-Share Alike 2.0 UK: England & Wales" License. Basically this means you are free to copy, use and modify the code/application in any way you like as long as: you give the original author credit and if you alter, transform, or build upon this work, you may distribute the resulting work only under a licence identical to this one.

By downloading you agree to be bound by the license.

Ok legalities out the way :)

BV2 Thermal Monitor:

[caption id="attachment_861" align="aligncenter" width="268" caption="BV2 Thermal Monitor v0.1 Screenshot"]BV2 Thermal Monitor v0.1 Screenshot[/caption]



Currently it is version 0.1 and does basic temperature monitoring of up to two NVidia GPU's.

It updates approximately every 500ms.

It can be minimized to the system tray where hovering the mouse over the icon will display a tooltip with the temperatures detected.

Tuesday 26 May 2009

GPU Temperature Monitor

I was hoping to release the GPU temperature monitor to the downloads section sometime during this last bank holiday weekend. I had also planned to sort out my somewhat ailing / overheating computer and perform some upgrades. Unfortunately the repair / upgrades took almost 2 full days. My pc is now running a lot cooler and a bit quieter and I am now something of an expert with heat spreader / heat sink cleaning and thermal grease application.  By the way artic silver is really good! Even before the 200 hour break in period mentioned on their site I'm already seeing more than 10 degrees lower temperature on the CPU.

As to the upgrades:  more posts on this later :) But it did include me purchasing Windows XP Professional 64bit.

The longer than expected pc maintenance time has impacted the GPU Thermal Monitor application and it won't be ready for a few more days. A bit of good news about it though:  it works perfectly on Windows XP 64bit without any modifications. Don't you just love asm :)  Some of my other C/C++ applications were not quite so happy on the new OS.

Keep watching this space for release date :)

Bletchley Park

As part of our "honeymoon" Catriona and I did a series of day trips around the South and Midlands of our not-so-little island. We are both members of the National Trust but don't often have time to visit all the locations on offer. We saw some lovely homes and parks and with Catriona having studied history she made a very capable tour-guide :)

One place that is worth a special mention is Bletchley Park. Sadly, it isn't part of the National Trust and receives no Government Funding (well done to Milton Keynes Council for giving £600,000 for urgent restoration).

Friday 22 May 2009

GPU Thermal Monitor

(update 27/5/09 - I've now released v0.1 - see this post )

It's been a while since my last post as I've been rather busy with work and some other projects - so time for a quick update. 

While working on some long running kernels I wanted to keep track of the GPU's temperature and ran the monitor app that came on their installation disks. The problem with the monitoring apps that I have is that they are quite large and take up a lot of screen space and for some unknown reason they crash / stop working when viewed over VNC (or similar) remote control app.  Now as I mostly use my CUDA machines over the network this is not a good situation.

So decided to write my own :)  Here is a screenshot of the initial version. It still needs config and some cleaning up (can you spot the glitch by the minimize button?) and possibly some remote reporting over the network functionality. Currently it can display the temperature from two GPU's and updates every 500ms. When minimized it sits in the system tray and updates its tooltip with the reported temperatures. I will be releasing it to the downloads section of the website sometime over the weekend. The current .exe is 146k (mostly the skin) and doesn't require any installer.

[caption id="attachment_823" align="aligncenter" width="252" caption="GPU Thermal Monitoring Application"]GPU Thermal Monitoring Application[/caption]

Nothing wrong with a bit of self promotion sometimes :)

Saturday 16 May 2009

Tesla C1060 memory performance #2

In my post a few days ago I mentioned that some of the numbers reported by the Visual Profiler needed further investigation.

In particular the device to device memory bandwith reported by the profiler differed from the value reported by the bandwith test sample. This was easily tracked down:  according to the documentation, the profiler divides the bytes read/written by 10^9 whereas the bandwith test sample divides by ( time in milliseconds * 2^20).    So: 1000*2^20/10^9 = 1.048576    i.e. the bandwith sample reports a lower number.

Tuesday 12 May 2009

D3Q19 Lattice

I managed to get some time last night to convert my LBM implementation to CUDA. Its far from optimal at the moment. Here is a screen shot showing the lid-driven-cavity with two obstacles, one horizontal and one vertical in the box. The visualization is from the middle plane. The obstacles are not shaded but are easily seen in the picture as an area of black. This image was taken after 16000 timeslices and some nice stable vortices have developed.

Lid driven cavity with obstacles

Once again the importance of making a gold kernel cannot be understated as I had quite a number of bugs in my initial CUDA implementation. I use a "multi-tap" type approach when debugging where the kernels write intermediate results to device memory as they go along. This can be easily compared to data coming from the various stages of the gold kernel and makes it much easier to identify the source of the error. Keep in mind the CUDA floats will never be the same as the CPU's floats (CPU uses 80 bits to compute intermediate results).

Tesla C1060 memory performance

According to the CUDA programming guide the memory coalescing rules have been relaxed in devices with compute capability 1.2 or greater.  Chapter 5 has a subsection entitled "Coalescing on Devices with Compute Capability 1.2 and Higher" which gives more information.

As I now have a C1060 which has compute capability of 1.3 I thought I'd run my old coalescing tests on it to see how it has improved.  I modified the launch configuration to 32760 blocks in order to maximize utilization of all the Multiprocessors and increased the thread count to 256. These changes cause the kernels to report 100% utilization in profiler.  I expected the uncoalesced part of the old tests to be quite kind to the Tesla as they fall within the parameters given in the programming guide and indeed the memory transfer rates were similar to a pure device to device copy. I then modified the uncoalesced kernels to avoid the auto coalescing of memory accesses in a half-warp.

Thursday 7 May 2009

D3Q19 Lattice

After playing with a D2Q9 lattice mentioned in a previous post I felt I'd learned enough to progress to the wonderful 3D world :) The arrival of my Tesla has also given me more processing power to move into three dimensions.

So far, in order to get data I can compare the CUDA kernels against, I have come up with a cpu gold kernel. For the first test have constructed a Box with 5 sides and constantly moving flow in the top layer of the box. I am only simulating incompressible at the moment.

I have made a very simple OpenGL viewer that can either show me the entire box or a plane through it. The direction of the velocity vectors is indicated both by their colour and direction of line while the magnitude of the velocity is indicated by the length of the line (thats why some spill over the edges of the box). The below image shows a section through the middle of the box.

[caption id="attachment_763" align="aligncenter" width="292" caption="Visualization from the gold kernel"]Visualization from the gold kernel[/caption]

The next step is to construct the CUDA kernels and compare their outputs. I'm hoping for a massive increase in lattice update speed whilst maintaining numerical stability. The gold kernel uses doubles for reference purposes.

Tesla C1060

On Tuesday I received my brand new Tesla C1060 :) right on the day it was promised by Armari.

A quick word about Tesla suppliers... Although Armari was efficient to deal with, some of the others were absolutely terrible. Possibly because I am a private individual only buying one unit or maybe just because their overall service is terrible. Their names are not mentioned here as I have complained already and it's only fair to give them a chance...

For those not familiar with the specs it has 30 multiprocessors giving 240 stream cores with 4GB of 512bit wide DDR3 memory. It is a rather large double slot 10.5 inch long PCI card.

Wednesday 29 April 2009

Visual Studio 2008 and CUDA

As promised I received my Visual Studio 2008 on Friday from polyhedron and set about installing it on Saturday. This post will describe how to get VS 2008 to work with CUDA.  I use CUDA 2.2 which is still under NDA so I wont be making any comments about its performance improvements today. Rather I will describe how to set up syntax highlighting, building and Intellisense.

Dynamic Holographic Displays

Every few days I have a look and see who has been visiting my site. There are lots of companies and Universities doing very interesting things and as they have found this site its likely they are interested in the same things I am, so its worth having a look at what they are busy with.

Today I found Zebra Imaging who have designed and built (alpha prototype) a dynamic holographic display. Although its difficult to see how it works on a 2D screen I'm imagining something like what R2D2 produced in Star Wars. Truly stuff of science fiction and will be interesting to see how well it works / what type of takeup it will get in the various fields they are targeting. Good luck guys! :)

Also worth a visit is the University of Rostock who are doing interesting things with liquid modeling and object<->liquid interactions. 

On the subject of visitors, I get a lot of Universities having a look here but in particular a big "hello" to Stanford who seem to visit their bunny on a regular basis :)

Tuesday 28 April 2009

3D Gaussian - in sections

After installing VS2008 it took me a little while to get all my projects to compile nicely again. Quite a big jump from 2003->2008 :) and a lot of my paths were a bit out. I'll post later about getting CUDA 2.2 to work in VS2008 and the problems I had making it cross compile.

Just time this evening for a quick update: I finished my 3D Gaussian convolution on my generated data set - the attached image has clearly visible joins in it, these are to make sure my segmented approach of processing big data sets works correctly. The are easily removed by overlapping the non-boundary segments. The colours are based on my transfer function and don't reflect the underlying data accurately yet. It renders the volume at around 40fps.

3D Gaussian convolution (segmented) output

OpenGL Bindless Graphics

When picking up my rss feeds this morning from Google Reader (which is really good) I came across a very exciting article from NVidia claiming a massive speed increase in OpenGL. The article is here

In the past I mostly used PBO's but have been using more and more VBO's in the CFD visualizations so I will give these new extensions a try soon. Keep in mind it only applies to 185 and above drivers.

Wednesday 22 April 2009

CUDA Processing of Large Data Sets

As those who follow this blog will know, I have been playing with large generated volumetric data sets (around 16GB). The first operation I performed on them was a 3D Gaussian convolution in order to smooth a lot of the randomness out. My first implementation of the kernel was just the naive one - take a point and look at all the surrounding points. As I'm using a 5x5x5 kernel this means 125 memory reads per point - very poor!

Monday 20 April 2009

Qt Creator or Visual Studio 2008

I have been using Visual Studio 2003 for a long time now - since it first came out in fact. It has performed it's job well. The random crashing of Studio 6 seemed to have been resolved and the autocomplete worked (75% of the time) Intellisense was still a bit of a nightmare but overall I lived with it quite happily.

Sunday 19 April 2009

Volumetric Test Data

Here are two screenshots of my 512x512x512 generated volume.  The one on the left is formed by generating random points in rough bands along the y axis with a higher noise factor towards the top - hopefully representing less dense material. Hidden inside the volume is a cube of much lower density than the surrounding regions.

The one in the centre is the same volume after the 3D Gaussian kernel was applied although it is at a slightly different angle. The right hand side screen capture is using a higher transparency function to show the little hidden cube.


I did run into a problem with the watchdog timer generating a timeout when running the kernel on the full 4kx4kx4k data set. Strangely switching the device to my non-display card didn't help at first. It turns out that Win XP had decided that my second 8800GT had a monitor attached to it, which seems to have activated the watchdog timer for the video driver. Something worth looking out for if you get watchdog timeouts on secondary cards.

Friday 17 April 2009

Large Data Sets and 3D Gaussian

Firstly a small fix:

In the post about the 2D gaussian there was a typo:

\$f(2,2) = Ae^{-(\sfrac{2^2+2^2}{1.4^2})}\$

should read

\$f(2,2) = Ae^{-(\frac{2^2+2^2}{2(1.4^2)})}\$ 

this has now been fixed in the original article.

Last night I successfully made my 2048*2048*2048 * 2 bytes data set. The random data was generated using the mersenne twister algorithm with a fixed seed so I can recreate it later if needed. I also created a smaller set of 512x512x512 in order to interactively test my kernels.

Thursday 16 April 2009

Oil and Gas

Oil and Gas: What a fascinating industry! With all the different technologies and processes involved there is so much to learn.

Why the sudden interest you may ask? Well I may have an opportunity to work for a company involved in supplying software to the oil and gas sector in a part of the UK in which we would very much like to live and doing a job with such a wide variety of interwoven fields is rather appealing. As I have no oil industry experience I have been reading and googling as much as possible.

It seems like the software and technologies required revolve around:  3D visualization and analysis of volumetric data (seismic), construction of data based on field measurements, Geographic Information Systems, GPS, fluid / gas flow simulations in porous media or in hollow volumes, visualization of the strutures (rigs etc), drilling planning and plotting. And I'm sure there are even more things to discover.

Wednesday 15 April 2009

Playing with the Lattice Boltzmann Method

I've been learning more and more about the LBM (Lattice Boltzmann Method) recently and decided to implement a very basic 2D one last night. More to educate myself than for anything useful.

I decided on the D2Q9 Lattice as it should give nice results in 2D while being less complex to implement then the D3Q19 (3D) one.

My colour mapping isnt great and for some reason the simulation isnt responding as expected to different Reynolds Numbers. My initial conditions and boundary conditions need work too. For what its worth here are two screenshots of the velocity field in a pipe with 2 Cylinders projected through it. (Lighter colours indicate higher velocity)



At 2:30 on 4/4/2009 Catriona and I got married :)

We had a lovely weekend at Loch Ness Lodge with our families and some close friends, culminating with our wedding ceremony on Saturday followed by the wedding breakfast.

After three days of sunshine it was inevitable to get rain on Saturday, but as wearing a kilt/waistcoat/jacket can get rather warm having a cooler day was rather pleasant. By early evening the rain had cleared which gave us an opportunity to have some outdoor photographs. (will post some soon)


Scott and the team at Loch Ness Lodge do a fantastic job and really do make you feel at home, if "home" is a well appointed lodge with "fairies" whom you never see but do everything for you. The food they prepare from tea to sandwiches to five course wedding breakfast is superb using really fresh and high quality ingredients.

Friday 27 March 2009

2D Gaussian Function

In my last post I mentioned using 5x5 Gaussian kernel as a Point Spread Function to make an image blurry.

What occured to me last night is that these Gaussian kernels are demonstrated so many times in image processing without any explanation of the underlying function used. A Gaussian function is in fact continuous and not like the 5x5 matrix I showed at all.

So to explain a bit more lets derive our 5x5 Gaussian kernel in the last post from the 2D Gaussian function (I've installed LaTex plugin to wordpress to make it a bit more readable):

Thursday 26 March 2009

Canny Edge Detection Steps

As promised in my previous post and in response to a comment  , here is a step by step explanation of the Canny Edge Detector. I have some images from a previous post available here.

Firstly we need to apply a Gaussian Point Spread Function (PSF) to our image. This will make the image seem slightly blurry. Increasing the width of the filter will of course increase the blurriness of the image. A wider filter tends to cause the Canny Edge detector to detect longer smoother lines while a narrow filter allows it to find shorter lines.

Almost Wedding Time!

Regular readers of this blog will remember that I got engaged last year ( Time really does fly and we are getting married in less than two weeks time! :)    I might even post a few photos here :)

This news hopefully explains the lack of updates on any projects lately although I have done a bit of work on the raytracer.  Interestingly some of the ray tracer accelerations can also be used in the fluid modelling project with a few tweaks.  On the visualization interface front I've been playing a bit more with the new qt4.5 and it is pretty decent. I'm still loving the new IDE.  But....  in terms of quickly making an extensible, good looking and function interface Flex 3 is winning by a large margin.  Actually I should clarify that the flex isnt actually doing any calculations its merely displaying images it is retrieving from the rendering service. If the app itself was calculating and rendering then I would probably pick qt4.  Personally I like the idea of seperate bits doing things on their own, which is nice for redundancy / scalability but not everyone likes to work in this manner.

Monday 9 March 2009

Qt Revisited

I managed to find some time to play with Qt this weekend. Firstly, I am rather impressed by the new IDE: While not quite as good as eclipse in terms of editing it seems more stable and uses FAR less memory. It is definately more user friendly than my visual studio 2003 and has a nice auto-complete feature that seems to actually work.

The new widgets and libraries are great if a little large and its pretty quick to whip something together. A nice touch in the IDE is the ability to just drag and drop signals to slots right on the visual layout of the form.

Friday 6 March 2009

Qt 4.5

As my raytracer is getting more and more complex with lots of dynamic options and movement controls trying to remember keypresses from a command window to turn things off and on is getting increasingly difficult.

The time has come to build a proper interface for the rendering window. Now as I dont want to spend too much time developing this from scratch I thought I'd have another look at Qt (pronounced cute). Its been quite a while and a good few versions since I last used it, and am pleasently surprised at the improvements and additions. As it does offer openGL widget / rendering it is the perfect library to use for my raytracer.  Using openGL and Qt still gives me the option of going cross-platform on the raytracer and as the memory handling / page locked mem is faster on linux this could give a few more fps :)

In short, if you are looking at a C++ friendly interface designer you could do a lot worse than using the new Qt. (

Thursday 5 March 2009

GPU Radix Sort

In the ongoing search of finding efficient GPU sorting algorithms I came across the following paper:   by Nadathur Satish, Mark Harris, and Michael Garland.  It is a preprint for the 23rd IEEE Int’l Parallel & Distributed Processing Symposium, May 2009 (

I have not tested if it is faster than the GPU quicksort ( Normal quicksort does slow down on nearly sorted or already sorted data so a radix sort might exhibit more stable behaviour.

Thursday 26 February 2009

Voxels, Dynamic Grids, KD Trees

During the course of developing my raytracer I've investigated various acceleration structures and made comments on them in previous posts.

My primary focus has been ray tracing of dynamic scenes where you would need to re-calculate the acceleration structure for each frame. As a result of this I ruled out KD-Trees (even SAH partitioned ones) early on and concentrated on methods which could be implemented well in parallel.  The voxel method with fixed maximum object counts served me very well in early implementations.  It is very quick to compute in parallel and its memory accesses on traversal are very kind to a CUDA implementation.  However as the number of objects in my scenes have increased, it has caused problems with this fixed max element voxel structure.  It is possible for the number of element to exceed the maximum in the cell thereby resulting in visual artifacts / lost objects.  Increasing the maximum elements per cell slows down the raytracing stage too much even though the objects are no longer lost.  To overcome this limitation I increased the number of cells in the grid.  Again this works, but slows down the raytracing due to the higher number of voxels needing to be traversed. I was using a mesh of slightly overlapping spheres as they are much quicker to run intersection tests against rather than the standard cube approach.

Monday 16 February 2009

My computer is full of Stars!

And galaxies! On thursday my colleague Richard ( pointed me to  Apparently he studied cosmology before completing his Phd in the Semantic Web and still has an interest in it.

So far I've downloaded the small sets of data. The full set is around 100gb of compressed fits images so will take a while on my rather slow BT connection, not to mention exceed my 30gb cap...  I have not looked at the data yet and will need to construct or download a fits viewer.

Thursday 5 February 2009


This morning I woke up to discover even more snow had fallen than there had been on Monday. The weather people claim that this is rather unusual for this part of England.

Here is a picture of Buttercup, my Landrover, on our way to work this morning. The towrope and winch at the ready to help any lesser vehicles.

[caption id="attachment_450" align="alignnone" width="450" caption="On the way to work after heavy snow."]On the way to work after heavy snow.[/caption]


For once I have refrained from running any Gaussian PSF or Edge detector on the poor Buttercup :)

Monday 2 February 2009

CUDA Positions

Apart from the NVidia CUDA employment forum there seems to be a lack of sites where adverts may be placed specific to CUDA development.

I decided to add a page ( for companies to advertise their available permanent or contract roles involving CUDA (or OpenCL) or for individuals and groups to advertise their availability for full time employment or contract work. Development teams specializing in CUDA development are welcome to add their names here too. If you would like your free advert to appear on the page just add a comment or drop me an email. If we get enough I’ll soon update the page to a decent view.

Friday 30 January 2009

Eurographics Symposium on Rendering

The 20th annual Eurographics Symposium on Rendering will take place in Girona, Spain, from June 29th to July 1st, 2009.  (

The call for papers has some interesting topics on the list. I think this could be well worth attending to catch up with the latest developments. I am very tempted to write a paper for it but with the upcoming wedding and huge work load I don't think its feasable.

Coalesced write vs Coalesced read counters in Profiler

In my last post I mentioned that I was still investigating why the CUDA profiler was counting 4x as many Coalesced writes as it did reads. While the profiler's manual does mention that the counters are merely a measure of performance and not absolute it still did not make sense. 

As always the helpful people at NVidia responded to my request almost immediately.

Apparently the GLD gets incremented by 1 when a 32B/64B/128B gld request is sent but the GST gets incremented by 2 for 32B, 4 for 64B and 8 for 128B requests.

This explains perfectly why my counters for GST are 4x higher than the GLD counters.  The time differentials between uncoalesced reads/writes seem to remain which is worth remembering for the reasons mentioned in the last article.

Tuesday 27 January 2009

Investigation of Uncoalesced vs Coalesced Read and Write Speeds in CUDA

While working on my sorting algorithms I came across an interesting scenario. At the moment my still incomplete sort uses two kernels. At the end of kernel A I have the choice of writing out to global memory in an uncoalesced or coalesced manner.  If kernel A writes out in a coalesced way then kernel B's initial read of global memory will have to be uncoalesced. Likewise if kernel A's write is uncoalesced then kernel B's read will be coalesced.

The CUDA documentation does not mention if uncoalesced writes are the same speed wise as uncoalesced reads so in order to get the peak performance of my sorting I needed to run a few tests with different kernels in the CUDA profiler.

I made four kernels, each doing a very simple task:  read in from global memory and store in shared mem then write from shared mem into a different area of global memory. Each kernel reads or writes to or from global memory in a different fashion to ensure coalesced / uncoalesced access for my 8800GT.  Note that on newer cards the coalescing rules are a bit different (see the documentation for details)

Friday 23 January 2009

Sorting Algorithms

In most programs there comes a time when you need to sort your data into some sort of order. Most recently I have needed to do this in both the "collapsing universe" simulation and my real time ray tracing acceleration. Ideally if you can get away without sorting data you should, but often you need to in order to accelerate a later stage of the algorithm thereby making a nett saving in time.

The popular sorting algorithms such as bubble sort, insertion sort, quicksort, radix sort, heapsort etc  range in worst case order times from  O(n log n) to O(n^2) .  The order here is based on the number of element comparisons. And for memory requirements they range from O(1) to O(log n)   (quicksort)  although some not quite so popular algorithms use higher memory requirements.

Thursday 15 January 2009

CFD and our collapsing galactic centre

While doing some more research on CFD and the Lattice Boltzman method I came across the following paper (December 19 - 2008) which covers the basics of the method and an efficient GPU (CUDA) implementation.

Unfortunately I have done little in the way of my own implementation as I have been playing with collapsing our galactic centre. So far I have modified some of my code which usually looks for letters in a scanned document to finding stars in the bitmap. The next stage is to assign a z coordinate to the point based on its size / whiteness value and an x,y based on the original image coordinates and then the collapsing can begin :p   Hardly scientific but particle simulations are always so much fun.

Data Sets

After complaining in my last post about not being able to find freely available real world data sets for tomography I managed to find the following sites:

I've only downloaded one set so far and haven't had a chance to look at it but from the thumbnails on the sites they look good. The Christmas tree model in particular looks like a lot of fun to work on and as a bonus it is huge.

Wednesday 7 January 2009

CFD / Seismic Analysis / Tomography

I'm still very busy playing with ray tracing as there are just so many opportunities to optimize the performance. It also involves a lot of converting seemingly simple serial algorithms to parallel versions - sorts / searches etc. Couple that with a full time job and sometimes I just need a break. A break in my book is defined as "a different project" :p   

Not that I'm stopping work on raytracing and rendering - far from it, in fact the alternatives I'm looking at include a lot of similar processes / algorithms. The three topics in the title are what I'm looking at being largely inspired by the great work done by the guys who attended the Tesla Personal Supercomputer launch last month.

Galactic Centre Photograph

I found this on the Hubble site

[caption id="attachment_339" align="aligncenter" width="400" caption="Galactic Centre"]Galactic Centre[/caption]

They also have much bigger versions available up to 6149 X 2902. (according to it seems that I can post it here) In itself it is pretty remarkable but also provides a rather nice large and complex image for image processing. I have been playing with trying to extract the individual stars and using their intensity / connected region size to specify a Z co-ordinate. With brighter / larger being closer. If this seems odd it is because I dont have the stars real positions and its just for fun anyway :)   Then using these x,y,z coordinates and running a gravity simulation on them.  I know this has absolutely no scientific merit as there is no initial velocity / mass / proper coords / space is probably warped there / etc  but it could look cool...

Monday 5 January 2009

Happy new year!

Happy 2009 to everyone, I hope its a really good one and you get lots of new hardware to play with :)

This year I've got a lot of ideas to play with, and having been given a whole lot of old physics books (mostly particle physics) I may just get around to making some simulations.

Nothing new on the raytracing front yet, although once again I have been playing with some huge models and working out better ways to build (or rather not build) acceleration structures.

More soon...