Amdahl’s law

October 19th, 2009 by Barrett

A few months ago I made a post mentioning how I don’t conform to the Amdahl’s law way of thinking but never went into any details.

The law describes the speedup that can be obtained if you can parallelize a section of your problem. The speedup that can be obtained is described by the following equation:

{\frac{1}{(1-P)+\frac{P}{S}}}

Where P is the proportion of the problem that can be parallelized / sped up and S is the speedup amount.

Assuming that S->infinity  then P/S -> 0  this leave us with {\frac{1}{(1-P)}}

This implies that no matter how many processors / speed improvements we make to the P portion of the problem we can never do better than  {\frac{1}{(1-P)}}   And the biggest % improvement from the baseline comes with low values of S (or relatively low numbers of parallel processors). This result is observed in the field time and again. Very seldom does throwing more than 4 or 8 processors at a problem speed it up any more than the large gains you get from the first 2 or 4 processors.

This equation does expand with multiple P and associated S terms in order to describe a more complex / lengthly problem: (P1+P2+P3 = 100%)

{\frac{1}{(1-P1)+\frac{P1}{S1}}}+{\frac{1}{(1-P2)+\frac{P2}{S2}}}+{\frac{1}{(1-P3)+\frac{P3}{S3}}}

Certain problems where P is large do respond well to the increase in processors these are known as “embarrassingly parallel”, ray tracing is rather a good example of this.

 

So why do I not agree with this if the equation makes sense?

The assumption that only P areas can be accelerated by S and strung together in a serial fashion is rather simplistic.

Why do we have to finish P1 before beginning P2?  Even if the P2 area has dependancies on P1 its rare to have the entire section of P2 to depend on a single result (of course there are cases - reduction kernels etc)

Maybe P3 can overlap P1 and P2, some may benefit by having more processors while others may reach an optimal at two. Why not overlap the sections and supply them with their optimal processing power? This is easy to achieve with Directed Acyclic Graphs (DAG’s) and can even be computed on the “fly” although they do get rather large!

Quoting Amdahl’s law as a reason why no further speed benefits are available in a system is really just showing that thinking is still stuck in serial mode with little bursts of parallelism thrown in.  Lets starting thinking parallel in all areas and make the most of all available compute resources.

Fermi

October 1st, 2009 by Barrett

I’ve just completed reading the white paper released by nvidia which you can find here.

Rather interestingly no mention of graphics performance has been made which, in a way, is really exciting. This has clearly been aimed at the high performance or throughput computing markets with the notable inclusion of ECC memory and increased double precision throughput along with the updated IEEE 754-2008 floating point support.

Concurrent kernel execution and faster context switching will allow, with the use of DAG’s, the optimization of execution on the devices rather than just working out the most efficient order of kernels to execute sequentially.

Also tucked away in the white paper is the mention of have predication at the instruction level which should give greater control of divergent paths in your kernels.

The inclusion of C++ support will appeal to a lot of people but am I rather unconvinced this is the correct way to go for throughput computing as it will encourage the use of all the old patterns that may work well in serial cases but are often rather poor for enabling maximum throughput.

There is a lot more in the paper and already an announcement by Oak Ridge that they will be using it in a new supercomputer.

All in all its a wonderful development and I can’t help feeling that computing took a substantial leap forward today.

Work and Spam…

September 25th, 2009 by Barrett

Yes I am still alive… :)    I’ve been exceptionally busy at work lately, not complaining though as work pays the bills and gpu / compute things are just a hobby.

What I do find rather frustrating is in the limited time I have to maintain the blog, it all gets eaten up by wading through mountains of spam. Akismet does a fantastic job, but on the forums side the bot detection captcha is pathetic and I have around 100 signups / 100’s of messages to check every day. So until a later date when I can fix / upgrade it I will be suspending the forums from today.

On a rather more positive note:  Nvidia has announced OptiX ray tracing engine. Some examples are out already so worth having a look at.  Nvidia has also announced Nexus which looks incredible. I have heard that they may charge for it but with the features it apparently has I would be willing to pay.

For some of the latest conferences papers and pre-prints have a look at Ke-Sen Huang’s page which he keeps updated with some really nice stuff.

As for me - no new cuda / compute development for a while. When I get a few mins free I’m tending to work on the maths for the stuff I want to do but never get enough time to implement anymore. Hopefully this will change in 2 or 3 months once work settles down a bit. On the subject of work there have been one or two questions over email and on the blog of the steps taken after an edge detector has been applied - unfortunately I cannot answer this as it gets a bit close to what I do for work (OCR and document processing). Rather interesting one of my ray tracing acceleration algorithms which turned out pathetically for its purpose has been rather good at identifying features in 2d/3d images but more on this at a later date once its a bit more complete.

3D Gaussian Convolution

July 30th, 2009 by Barrett

There hasn’t been much in the way of posts here lately as I’ve been really busy at work getting some new components built into the systems I work on. Not really hard but it’s frustrating things like trying to get various components and libraries written in different languages to work together. So lately I’ve not had the energy to do much work on the computer once I get home…

There has been a bit of interest in my 3D Gaussian convolution kernels. Although I explained the technique mathematically in an earlier post I never actually posted the code. As it is rather quick and quite a novel way of calculating the convolution for a xy plane I decided to post it so everyone can benefit from / improve upon the technique.  As always comments / bug reports etc are always welcome :)

Read the rest of this entry »

Happy blog day!

July 23rd, 2009 by Barrett

My little blog is officially one year old today! :)

Over the last year I’ve made many more posts than I was originally planning to make and judging by the number of subscribers and variety of institutions and people who visit it’s not entirely all rubbish, or so I like to tell myself.

Ongoing projects that will hopefully get completely or improved in the coming year are:

Thermal monitor - networking coming soon

More raytracing (of course) - linking PhysX to it (thanks to Timothy’s post here for inspiration) to improve on my rather primitive bouncing balls.

SPH and Lattice methods for fluid simulation

Sorting algorithms - I haven’t posted much on this lately but some decent results so far

Massive Data set processing.

Anything else that grabs my interest

Throughput Computing

July 21st, 2009 by Barrett

Two weeks ago I attended a seminar at the Daresbury laboratory near Warrington. Unfortunately I only had the day off so couldn’t attend the workshops on offer and buttercup, my trusty landy, had to drive there and back on the same day. Sorry to everyone on the m4, m5 and m6 that day :p

Jack Dongarra presented a rather nice summary of supercomputing / parallel computing or as he called it “Throughput Computing”.  I had never heard this term before and now really like it. It implies that you should make the best of available resources in order to maximize throughput. It’s very easy to get into a mindset where everything has to be in parallel but it’s possible by doing so you neglect to notice that some parts work very well in serial.

Read the rest of this entry »

Numerical Precision

July 3rd, 2009 by Barrett

Numerical precision is an ongoing concern of mine especially in big / long running simulations and solvers.

I came across an article by Rob Farber on the scientificcomputing.com site this morning that asks the question “How much is Enough?”.  Although no definitive answers are presented the author summarizes the current and future concerns over accuracy.

Personally I don’t believe floating point is the way forward. Floating point is fast to calculate in hardware but is not always an ideal way of representing numbers. Although the various branches of mathematics are largely base independent humans are most comfortable with base 10 while computers are of course most comfortable with base 2. This does result in some situations when a calculation in base 10 with only a few decimals of precision gives precise results whereas a calculation in base 2 is incapable of giving a precise result even given N bits of precision although the result is probably acceptable after n bits.

I’m not presenting any solution to the precision problem, but merely pointing out that sometimes the issue is caused by:   using base 2 for calculations  and/or  the floating point representation of these numbers.

BV2 Thermal monitor v0.11

July 1st, 2009 by Barrett

Here is a small update to the Thermal Monitor which implements the “always on top” feature. This can be enabled and disabled by right clicking on its little system try icon.

This is not the point release as planned as some of the additional features were appearing a bit unstable on my Win XP pro 64 machine. Until I find time to sort them out I thought I would release the most requested feature.

The source code is included. It is still in masm and does include routines for non-standard window shapes and transparent blts etc - could be worth a look if you are interested in masm32.

It is now available in the downloads section or here:

[Download not found] - Please see Licenses Section for license details. By downloading you agree to be bound by the terms of the Creative Commons Attribution Share Alike license.

Glass half full or half empty?

June 25th, 2009 by Barrett

Or as this is a hpc/cuda/parallel processing site:

Gustafson’s Law or Amdahl’s law?

Personally I prefer Gustafson’s Law …. it seems more logical to me or is this just because I’m inherently an optimist?

I would be quite interested on hearing your views on this - so comments /forum posts most welcome.

In other news:  The thermal monitor downloads have gone over 80! :)  The updated version (v0.2) is ready after mucking around with subclassing a control…. I will release it soon…

Otherwise I have been extremely busy on debugging a sparse matrix solver - bugs in huge datasets can be hard to find! Even with the aid of Gold / Sivler / Bronze kernels mentioned in the last post they have been proving remarkably tricky to isolate. Rather surprising to me is the fact that the long long data type doesn’t consume a lot more processing time than a normal unsigned int - so I have been using that wherever there is a risk of exceeding 2^32.

CFD code coming soon too - although I have unwound a lot of the optimizations in order to make it easier to understand and possibly be a good foundation for your own optimizations.

Right …. back to the grindstone!

Gold, Silver and Bronze

June 22nd, 2009 by Barrett

Kernels of course! :)

Most of the readers of this blog should be familiar with a “Gold” kernel in which your data is processed on the CPU (usually) and the output is carefully checked. This kernel and its associated outputs form the basis of the regression testing of subsequent implementations on the GPU including algorithmic optimizations.

Personally I like most of my gold kernels to be naive implementations of an algorithm. This causes them to be  easily verifiable and usually easy to debug if there is a problem.

If you currently don’t implement a Gold kernel before writing your CUDA implementations and/or adapting you algorithm I strongly suggest you do.

The purpose of this post is to suggest two other debugging techniques I use when needed and where possible. I call them my Silver and Bronze kernels.

A Silver kernel is implemented on the GPU without any optimizations or algorithmic enhancements. The grid / block structure is as simple as possible making sure we don’t vary from the Gold kernels implementation too much - only unwinding the loops into grid/blocks is allowed where possible. This type of kernel I use when I am writing something that depends on numerical precision. Once written and verified within acceptable numerical limits against the Gold kernel it becomes the new baseline kernel before later optimizations. This allows exact matching of later kernel outputs rather than using an “acceptable deviation” approach.

Read the rest of this entry »