ComputeCube

3D Gaussian CUDA Source

2017-06-17T14:30:00.000-07:00

The below code is meant to accompany the 3D Gaussian Convolution post. Note that the code is not generic and only calculates a 5x5x5 Gaussian on a set of 256x256 planes. It is very easy to modify to support other sizes and could even be templated. The Z convolution is not included in the code, but remember in this step you would need to do the final division.

As blogger doesn't seem to like keeping the format of code, please find it on GitHub https://github.com/Repmov/3D-Gaussian

ReMarkable Contract Negotiation

2017-06-16T08:32:00.000-07:00

I mentioned in a previous post that I've pre-ordered a ReMarkable . Mostly for scribbling down ideas and reading technical documentation.

It occurred to me that these pads could be ideal for contract negotiation. Many years ago we looked at the possibility of using iPads for this purpose but the supporting technology / cloud infrastructure didn't really exist to the extent it does now.

ReMarkable don't currently offer handwriting recognition but they do have wifi connections and hopefully some form of SDK. In the insurance markets a lot of contracts are still negotiated on paper in a face to face manner, for example we see all sorts of documents where the risk has been proportioned by simply adding hand written percentages to a document as the parties have negotiated in person.

Picture one of these ReMarkables with the Exari technology stack: In a simple scenario custom contracts could be created on the screen in real time and signed by the parties right on the ReMarkable pad with its pen then sent off to the cloud for capture into our Universal Contract Model.

A more complex use case could be two parties each with their own ReMarkable negotiating on their own version of a contract. Exari capture, match, and analysis technologies operating in the background highlighting changes, showing areas of risk, populating new clauses until the eventual agreement, sign and capture of data into the Universal Contract Model.

I really can't wait to get one of these devices now and hopefully a SDK to go with it!

XMG Walker

2017-06-15T06:02:00.000-07:00

Finally a "solution" for the VR tethering issue.

http://walker.xmg.gg/en/

Going to need some sort of proximity sensor with that else there are going to be a lot of bruises!

reMarkable

2017-06-09T03:55:00.001-07:00

After weeks of procrastination I pre-ordered a remarkable. With any luck it will live up to the hype and preproduction reviews.

What eventually convinced me? Well, once again leaving some notes I was working on at home...

I'm a little concerned about the latency (55ms) which seems a bit high, but the convenience of not having to manage stacks of notepads whilst also incorporating a PDF reader will hopefully mitigate the potential latency issue.

Will review once I get it in October sometime - long wait!

https://getremarkable.com/

Exari acquires Adsensa

2017-06-08T15:28:00.000-07:00

Adsensa, the company I joined when it was a startup over ten years ago, has been acquired by Exari systems. As this is a personal blog I hardly, if ever, comment on my job. I'm breaking from tradition here as it is a very exciting and complimentary deal between the two companies.

Here is the official press release: Exari acquires Adsensa

We are now part of a truly global company with a substantial increase in engineering resources and due to the geographic diversity we can offer improved support to our clients.

The mature Exari workflow should make an immediate improvement to the Adsensa products once integrated. Our industry leading capture, match and analysis tools will enable Exari clients to process legacy documents or documents they receive during the various phases of contract negotiation.

From a development perspective we are very excited about Exari's Universal Contract Model. This is something rather ground breaking and speaking from an Adsensa perspective we are looking forward to integrating our technology to populate the contract models.

It is rather pleasing to see something we have worked so hard on for so many years becoming part of something even bigger and making a real difference to the operations of our clients.

2D Gaussian Derivation

2017-06-07T02:23:00.000-07:00

I'm currently working on some image manipulation that requires a Gaussian Point Spread function that isn't uniform in the x and y directions so thought its worth revisiting the derivation from an older blog post along with some thoughts on optimization:

In one dimension the Gaussian function looks like:

\$f(x) = Ae^{- \frac{(x-b)^2} {2\sigma^2} } \$

where \$\sigma\approx2.718281828 \$ which is Euler's Number, and b is the point over which the bell curve will be centred. You should recognize the \$(x-b)^2\$ as the first step in calculating the distance between two points. A is the amplitude of the function. The bigger A is, the higher the peak produced.

As can been seen from this equation \$\sigma\$ controls the spread of the bell shaped curve produced. If its not immediately obvious then keep in mind as you divide by a bigger number then the fraction gets smaller and \$anything^0 = 1 \$

Now that we understand the function in one dimension lets extend it to 2 dimensions, after all that's what I am interested in for my image manipulation.

It is very simple to extend the function to 2 dimensions as we are really looking at the distance of a point from a centre location. For now lets assume our \$\sigma\$ (curve spread) is the same in each direction and that our centre point is \$bx,by\$

\$f(x,y) = Ae^{- \frac{(x-b_x)^2+(y-b_y)^2} {2\sigma^2} } \$

Again you should see the \$(x-b_x)^2+(y-b_y)^2\$ as the distance of the two points from the centre point - we are just missing the square root.

This is the equation we used to calculate our Gaussian PSF kernel, for this example we are going to use the following parameters:

\$A=15\$ and \$\sigma=1.4\$

We then take the points from -2 to 2 = 5 in each direction and plug them into our equation as x,y values. The resultant value is rounded and stored in our matrix.

For Example: x=2, y=2

\$f(2,2) = Ae^{-(\frac{2^2+2^2}{2(1.4^2)})}\$

\$f(2,2) = Ae^{-(\frac{8}{3.92})}\$

\$f(2,2) = Ae^{-2.0408}\$

\$f(2,2) = 15*0.1299\$

\$f(2,2) = 1.9488\$ which can be rounded up to 2 and this is the value we store at 2,2 in our kernel.

Note that as we are squaring our differences from the position to the centre in each direction the values at: -2,-2 ; -2,2 ; 2,-2 are all the same as the 2,2 one calculated above. We can use this as an optimization in calculating our kernel coefficients.

As mentioned above the \$\sigma\$ value can vary for x and y directions. This causes our 2D curve to be stretched/compressed in the x or y direction.

Our equation then becomes:

\$f(x,y) = Ae^{-( \frac{(x-b_x)^2}{2\sigma_x^2}+\frac{(y-b_y)^2} {2\sigma_y^2} )} \$

knowing about our one dimensional Gaussian function we can clearly see how the above function works: the two component directions are calculated first, added together, then the result used as the power to which we are raising e.

Once we have calculated our kernel coefficients we can apply them to our image. Remember that this is a separable convolution so don't implement it in the trivial manner by reading in a 5x5 block around your pixel of interest, multiplying, adding and finally dividing. Rather, apply the x and y convolutions separately which will involve 4 times fewer reads and these reads will be in a more cache friendly manner.

CUDA 9

2017-05-31T01:20:00.001-07:00

NVidia announced CUDA 9 a few weeks ago. I've been using CUDA since v1.1 and compute capability 1, and things have matured significantly over the years.

The new CUDA adds support for the new Volta architecture, C++14, faster libraries and Tensor core matrix multiply, which is clearly targeting deep learning applications. But, for me, there is one stand out feature: Cooperative Groups.

The release says that it is a new programming model for managing groups of communicating threads. What does that really mean?

Previously you could synchronize threads across a thread block with the __syncthreads() function. Cooperative groups allow you to define groups of threads at the sub-block and multi block levels and synchronization across the entire grid.

The grid sync means you now longer have to have multiple kernels operating in successive launches in order to complete a complex task on a data set. A single kernel can now operate on the data and using something like:

thread_group group_grid = this_grid();

//do something here

grid.sync();

//do something else here

//etc

You also get a this_multi_grid() variant which will synchronize the kernel across all GPU's its been launched on!

You no longer have to wait to the sync at the end of a kernel launch and launch another kernel from the cpu code. Presumably you will still be restricted by the timeout on the driver for your primary device.

This coupled with the pinned memory / zero copy means you can have long running kernels running all sorts of operations on memory that can be streamed into the device.

Exciting stuff!

The blog springs back to life.

2017-05-22T07:50:00.001-07:00

This last weekend I was going through some old hard drives whilst cleaning up / getting rid of old hardware and discovered a backup of the old blog.

Not everything was there but most of the images and some of the old code. So I've been updating the old posts and changing the html so that blogger renders the equations.

Nothing new yet but I've been busy in the last few years and have quite a bit of stuff I'll release here.

Anyone still interested in the GPU Thermal monitor: which, unbelievably still works on Windows 10, can find it on GitHub here:

https://github.com/Repmov/GPUHeatMonitor

There is a slight issue with the background bitmap not always displaying on the new version of Windows but the functionality is still there.

3rd UK GPU Computing Conference

2011-12-13T08:21:00.000-08:00

Quick reminder that the 3rd UK GPU computing conference is tomorrow (14th Dec 2011). From their web site it looks like there are a few spots available in case you havent already booked.

I'm looking forward to seeing what everyone is up to in the GPU space here in the UK. Feel free to come say "Hi" tomorrow, always fun to chat and network.

3rd UK GPU Computing Conference

2011-10-28T06:10:00.000-07:00

I keep forgetting to post this one:

3rd UK GPU Computing Conference

I went to the 2009 one in Oxford but missed the 2010 in Cambridge as I just couldn't face the journey in my Landy.

Well worth attending as you get a glimpse of what everyone else is working on in the GPU field. A great source of inspiration for new projects.

The deadline for abstract submission is 18th November.

64-bit ARM server processor

2011-10-28T01:23:00.000-07:00

This is very exciting news. A 64bit quad core chip that supports out-of-order execution will, in my opinion, turn the server market on its head.

When these get released I wouldn't even consider using anything else for my server needs. Coupled with the new stuff NVIDIA is doing with ARM or even ARM's own Mali GPU stuff and you will have a very powerful, low power consumption server.

From what I recall from an interview with the UK's Intel boss on BBC breakfast TV a month or two ago (I cant seem to find the link), he all but admitted they had already lost the mobile market and were going to concentrate on Ultrabooks but said that they still pretty much controlled the server market.

This news should really shake them up a lot. Intel in many aspects has been a victim of their own success and have got locked into an ancient architecture (x86) . Still dont count them out yet, in the past they did come up with some good RISC chips but the market just wasnt there at the time. We may seem them popping out some new architectures in the coming months. They certainly have the manufacturing processes in place and lots of very bright people on board. And as I've written before their compilers are superb.

Read more here at: eetimes.com

EvoPar 2012 Call for papers

2011-10-25T02:15:00.000-07:00

I received an email this morning which may be of interest to some of you in the GPU and GP space:
Just posting an extract here, you can see more on their site.

Part of Evo* 2012, the main European events on Evolutionary Computation:
EuroGP, EvoCop, EvoBio, EvoMusArt and EvoApplications -

11-13 April 2012 - Malaga, Spain
http://www.evostar.org/
EVOPAR: Track on Parallel and Distributed Infrastructures

Submissions are invited on (but not limited to) the following topics:

- Optimization of parallel architectures by means of Evolutionary Algorithms.
- Hardware implementation of EAs, including Field Programmable Gate Arrays
(FPGA), GPU, games consols, mobile devices.
- GPGPU optimisation (CUDA, AMD, ARM, OpenCL, etc., etc.).
- Improving scheduling techniques for peer-to-peer (P2P) and
grid systems or for running distributed EAs and GAs.
- Improving fault tolerance techniques for distributed systems and
distributed EAs capabilities for coping with failures.
- Analytical modelling and performance evaluation of parallel and
distributed infrastructures when running EAs.
- Improvement in system performance through optimisation and tuning.
- Case studies showing the role of parallel and distributed
infrastructures in conjunction with distributed EAs when solving
hard real-life problems.
- Parallel and distributed implementation of genetic algorithms.

IMPORTANT DATES

Submission deadline: 30 November 2011
Notification of authors: 14 January 2012
Camera-ready deadline: 5 February 2012

Track Organisers

F. Fernandez de Vega, University of Extremadura, Spain
W. B. Langdon, University College London, UK

New blog home

2011-10-25T01:57:00.000-07:00

BV2 has a new home on blogger (google) as I just couldn't be bothered to run my own backups, email hosting etc etc anymore and the expense of hosting my own server was getting a bit excessive.

The blog has been renamed to ComputeCube as the idea is to merge the two seperate blogs into one, both links should point here.

The automated wordpress->blogger import tool did a good job but some of the image links have broken. I will be working on fixing these. Just to be clear my move to blogger has nothing to do with wordpress. I still think its a really good blogging system, just at the moment my needs are better served here.

Google now pretty much hosts everything of mine. Email, blog and google+ for the social networking thingies. In my push to move with the times I even have a twitter account now (ComputeCube) which contains personal, blog and GPU related stuff.

On the GPU front: I'm still running 2x 8800GT's and a Tesla C1060 but am considering getting a 560Ti as they seem to offer the best price/performance ratio and allow me to use the new CUDA features.

The server crash did knock the wind out of my blogging sails somewhat but since my last post I've not been idle and have implemented a lot of my ideas in CUDA, watch this space for more details.

Server Crash

2010-07-24T17:31:00.000-07:00

Unfortunately about 36hours ago the Hard Drive in my web server decided to end its little spinning life. Although the site has been restored from a couple of backups some of the file download links are not working.

The files are still available but I need to fix the download links - please be patient :)

For anyone running download monitor out there, the newer update changed the DB table names. Don't do what I did or rather didn't.... and forget to change the SQL backup scripts to reflect these changes.

PC Design Lab - new case

2010-07-07T03:35:00.000-07:00

I just got an email update from PC Design Lab regarding their new case. Those of you who follow the blog will know my ComputeCube machine is built into their QMicro2 case, which has been really good with only one or two tiny niggles. Although in fairness they are caused by the the amount of power cables and the heat emitted by the Asus rampage 2 gene northbridge arrangement and the Tesla C1060.

The new case looks good and their have adopted the suggestions from their clients. You may have a look at the new pre-order case here. Even though they have raised the cage I would have liked it to be slightly taller to help with airflow over the GPU's and power cable routing. Strangely they mention it can now support 750w power supplies, but I have been running a 1250w one in the older case for a while now with no problems.

The radiator bracket is a really good idea and would have helped sort out the Rampage 2 Genes overheating northbridge rather nicely.

If you are looking for a SFF case in my opinion there is nothing better out there and this new model raises the bar even further. Now if I could only get my hands on the new one and a watercooling kit :)

Windows 7

2010-07-06T08:29:00.000-07:00

While ordering a replacement power supply for one of my machines I decided to add a Windows 7 to the shopping cart.

This was in no small part prompted by the NVidia parallel nsight addin for visual studio which would not work on my XP machines. An additional contributing factor was the lack of SP3 for Windows XP Pro x64. Of course the new office 2010 will only run on XP SP3, Vista and Windows 7.

Having had a brief and scary encounter with Windows Vista I am pleasantly surprised with the new version of Windows. Microsoft have done a good job with this one.

The installation process had to be a "clean" one as there is no upgrade path from winXP x64. I took a full backup of my "ComputeCube" machine and without bothering to format the C drive just popped the DVD in and rebooted.

The installation process was completely painless and apart from one reboot occurring too quickly for me to remove the DVD from the tray it was all installed within 15 to 20 mins. The windows 7 installer actually makes a backup of your old windows and program files directories so there was no need to go and fetch stuff back from my backups.

I did lose my dual boot option for my Linux but lately I've been running them all inside VM's so I'm not concerned by that at all. If you are looking for good VM software: Sun VirtualBox seems to tick all the boxes and even has an API.

After the first login I was happy to see it had picked up all my hardware including the Tesla C1060. The only thing it had got rather wrong was the IP address of the gateway - it had missed by one.... weird.

I've been using it since Saturday now with visual studio and office 2010 and apart from one frozen file copy dialog (which rather surprisingly could be end-tasked without crashing explorer...) it has performed flawlessly.

Windows 7 also allows you to use DirectCompute on your GPU's. I've not quite got to grips with it yet but it seems quite functional. I'll probably stick to CUDA and OpenCL for now, much like I prefer OpenGL over DirectX - I just don't have the time to learn all these technologies and think it makes a bit more sense to stick to the cross platform ones for now.

Tip: When trying to register com components via the command prompt, make sure you have selected "run as administrator" even if you are logged in with admin rights.

In summary Windows 7 installation is easy and thereafter it does what it says on the tin. What more can you ask from an OS?

Site Changes

2010-06-30T04:09:00.000-07:00

Yes..... I sold out. There is now a google adwords space on the top right on the sidebar and at the bottom of individual posts. I went for what I think is the most unobtrusive design they offered.

Google have emailed me adwords related stuff for ages and I have resolutely resisted as this site is in no way a marketing site. However, in the last few months the site and related equipment failures (power supply, hard drive, mail server etc) have cost me quite a bit and hopefully this will help me recoup some of that.

Looking at what they seem to be placing as ads on the pages I have been starting to wonder how good their contextual advertising is... still it's early days.

In related site news the forums are now permanently removed, although there are still some links left over as I write this. With the amount of spam / hacking etc its just not worth maintaining a forum on a small site like this.

Catch up

2010-06-22T04:02:00.000-07:00

No apologies for the long delays between posts, or even checking the blog. It just has to fit in with life at the moment and there is so much going on!

Those of you who have left comments on the blog and emails for me should have got an answer last night or this morning. A bit of a delay for some of you. I do apologise for those of you who sent an email and haven't got a reply back. My computer that used to handle all my email died and I haven't had a chance to fix it yet. I think its just the power supply.... hopefully! so will get the emails back in a few weeks when I eventually get around to fixing it.

I've subsequently upgraded my email server - so any forthcoming mails will get to me.

In development work I've been working on my SPH simulations, and some GP stuff whenever I get a chance. GP stuff is traditionally recursive - well the equation trees anyway and have needed a substantial amount of reworking to get working efficiently on the GPU.

Speaking of recursive.... in order to be Turing Complete (assuming infinite memory for now) do you need to support / include recursion? Some posters on certain forums seem to think it is needed, but personally I can't see why? Most recursion with a bit of effort can be iterative - although possibly not very pretty or efficient.

For example. A GPU doesn't really support recursion*, but I would consider cuda / GPU combination as Turing complete. Admittedly not very efficient in certain cases - single thread for example. And again ignoring the infinite memory issue. *You can if you implement your own stack type system in global memory...

I'd be interested in knowing others views on this - email the normal place or comment here :)

To all the regular readers of the blog - anyone else being amazed by the absolute explosion of GPU / CUDA related code / products / hardware. Very exciting indeed!

SPH Screenshot

2010-03-19T05:45:00.000-07:00

Finally the promised screenshot :)

SPH with symmetry

It's not all that impressive to look at as I've restricted all the particles to 2d although it does use 3d calculations. I do this to help look for any issues in the code as I find it hard to spot errors in a 3d particle rendering.

This particular screenshot has 64000 particles that have been dropped into the box in a column formation and are now starting to slosh around at the bottom.

The unusual thing with regards to a CUDA implementation is that it is using symmetry in the interactions thereby decreasing the memory/processing load. I've still got more work to do but its showing a lot of promise in running superfast particle interaction simulations.

I've aso been doing a bit of work on my second version of my raytracer. I've once again stepped away from KD-trees and Octrees and am using a type of BVH, ray marching system. Screenshots once I have a decent scene rendered :)

In other news I'm now compiling all my new C++/CUDA code in 64bit with the CUDA 3.0 beta. Although I think putting in c++ object support into CUDA was a mistake the new version does produce decent code.

Poor neglected blog...

2010-02-26T09:39:00.000-08:00

Nearly 3 months since my last post :(

Work has been exceptionally busy: In the last two months on top of my normal product maintenance and improvement duties I have prepared and filed a patent application, architected and largely completed a distributed, resilient document processing framework and found a bit of time to eat and sleep!

I've noticed other blogs in the raytracing / graphics / visualization space have been very quiet lately - maybe everyone else is also working like crazy?

Not a huge amount has happened in my raytracer and SPH projects although got some interesting effects running with a non-uniform mass particle system when I had time over Christmas. Screenshots soon.

I do have the beta release of Nexus (the NVidia Visual Studio plugin) but sadly it only runs on Windows Vista or Windows 7 which leads nicely on to my next point:

I am a bit irritated with Microsoft for two reasons: Even though I purchased a 64 bit Windows XP professional about 6 or 8 months ago there is no upgrade path to Windows 7... Secondly even though visual studio 2008 standard has a switch for openMP it doesnt contain the openmp headers. Only the more expensive professional version does. Not something that was immediately obvious from the documentation before I purchased...

Although I also run Linux (centos) I prefer to develop on a Windows GUI - less buggy and more responsive than gnome / kde in my opinion. For running code the Linux os does usually win though! I would really like to run Nexus so am a bit stuck about what to do.... Succumb and buy Windows 7 and get Nexus on Visual Studio? or just forget entirely about Windows development / environment and use Linux / gcc / Intel compilers instead? While the Intel compilers are great (if a bit expensive) for an IDE I really do like Visual Studio. Most of my code is cross platform and for graphics I mostly use openGL so could switch without too much trouble... But direct compute is so tempting.....

Arrrrgh what to do!

Mandelbulb

2009-12-03T02:37:00.000-08:00

I noticed this linked from both Atom and Real-time Rendering blogs. Cyril Crassin, the guy behind the amazing gigavoxels raytracing, has got a 3d Mandelbrot fractal rendering in real time.

As we know there isn't actually a 3rd dimension to the imaginary plane so some manipulation is required. The chap who discovered a good way of transforming it to 3d ( the Mandelbulb ) has a website which you can find here. Well worth reading and some pretty amazing images!

As a side note: I've owned the book: "Real-time Rendering" for a number of years now and it is an invaluable resource. The Real-time Rendering blog mentioned above is the blog by the authors of the book.

C / C++ and STL

2009-10-28T06:09:00.000-07:00

Before everyone gets really upset with the rest of this post, as is the trend in the OO community... I thought I'd start, rather than end, with a disclaimer: I use C++ and STL on a daily basis in my job, although I don't use all of what stl has to offer it does make coding in c++ much easier. C++ in itself does allow fairly elegant code (if constructed carefully) whilst providing a decent level of code performance. So I do actually like C++ and stl and they make my life at work much better :)

But this blog isn't about my day job.... It's about my tinkering with the wonderful world of parallel algorithms and CUDA code.

What a lot of people don't realize is that you *can* use stl, c++ classes and templates in a .cu file. As long as its client side code you should be fine. I've had a few compiler crashes when using stl especially the sort. To sort this out I used the overloaded < operator in your class, don't try and define a custom < method it will crash the compiler.

I was lucky enough to have Monday off so managed to find a bit of time over my extended weekend to do a bit of coding on the GPU Thermal Monitor and my ray-tracer. I've had drawings and code snippets written for my Proof of Concept ray tracer for a while now and just not had the time to implement them. For simplicity of debugging (come on, release Nexus!! ) I decided to implement my idea on the CPU and for speed of coding decided to use c++ and stl - after all it has served me well in the past.

My idea is highly parallel, after all ray tracing is a trivially parallel problem, and will eventually use persistent kernels like my cross-bridging thing did.

I got stuck into coding and made my first in long string of poor coding decisions. I decided to make my Rays into a class along with a vector and point class. After all there is a fairly limited set of operations on a ray and I could overload an operator to move t units along the ray thereby keeping the code nice and simple.

I read in all the triangles from my Stanford bunny model into a triangle class and assigned them into my modified grid data structure (also a class) using push_back on stl vectors (I prefer to use a .resize and index into them rather than using a 3d vector). From there a stl::sort got them in the order I wanted within the grid cells.

I now generated all the rays (rayclass) in the traditional manner from the eye through the viewport and ... yes... assigned them to a stl vector.

After a bit of care making sure the threads behaved in accessing the data structures I was done.

Good programming practice so far? In a purely OO / readability sense then yes. Nicely overloaded operators and hierarchy of classes including a few more support classes I have not mentioned. And it worked first try - apart from having to adjust the bunny position.

Success? Proof of concept working?

Er.. no :( In deciding to make the rays and other things a class I had inadvertently scuppered my whole idea. What I was wanting to do is group and process rays in packets based on position and direction in the grid. But by using a class for the rays I'd started down a serial path. Read in a ray, assign ray to optimal grid traversal thread based on pos/dir, intersect ray with grid, intersect ray with grid cells contents (if any), move ray onwards in grid.

What I'd originally wanted to was: optimal grid thread(s) fetches chunk of rays from pool based on pos/dir, the whole chunk gets intersected with grid then with objects (if any) and moves on

This seems like a trivial change and in fact its perfectly possible to do it with stl / c++. I could define a method for each ray class that would return a boolean to the calling grid thread indicating if it should be assigned to it. This again is inefficient as each ray object would have to be queried in turn - exposing the ray pos/dir as a public would be slightly more efficient (although bad, coding practice) and does not solve the problem of looking up a pointer to each ray class. That said, it's still possible to work out a quick way to traverse the pool of ray classes in the stl vector to determine which ones the grid thread should process.

The point here is not that c++ / stl worked perfectly BUT the way OO tends to force you into a particular way of thinking / implementation path.

Although OO can and does work well in a multithreading environment it does come from an era in software development where things were largely serial in nature and most of the design patterns etc tend to steer you away from a optimal solution in a "throughput computing" environment. OO has the added disadvantage of encouraging you to code for the single case and not for the group.

From now on I'm going to be very careful to avoid using "multithreaded" or "massively multithreaded" and "throughput computing" interchangeably as they are not the same thing at all. Although not mutually exlusive multithreaded implies lots of things running together in serial doing their own job and sometimes talking to each other via a variety of synchronization / sharing methods. Throughput computing is more about getting the job done efficiently, in general the higher the degree of sustained parallelism the higher the throughput.

So, how would I change my implementation?

Beware design patterns! Yes, great to use to get your work done - but efficient? Think carefully.

Rays would be generated and stored in a pool with only origin, direction, last grid intersection and tri intersection. The grid threads can then easily operate on chunks of this data and store results quickly and efficiently for the next kernel/grid (if traversing) to pick up.

The triangles could still be stored as classes but as we are only interested in the colour (I'm not using textures), apex, 2 sides and a normal it is much more efficient to store them in a flat structure and have the grid blocks store the index to the triangles that are within its bounds.

Arranging the data in blocks also allows us to re-arrange it to be in a format that is more friendly to the memory access patterns of the device / cpu.

Ultimately I see Objects starting to take more of a back seat in development especially in server side and throughput computing code. They still have an important role to play in many things - UI design is a good example. I can see some sort of DAG entity being the new "object" probably stored in pools of similar ones all needing the same sort of processing or dependencies. We will probably get a whole new bunch of design patterns to go along with them too - exciting stuff! Now who wants to write the new language / compiler??

So think carefully, make sure your implementation hasn't changed your way of thinking. The code is meant to describe your algorithm not dictate its direction!

Now just to find some time to re-do my ray tracing code.... :)

GPU Temperature Monitor

2009-10-27T03:28:00.000-07:00

As of writing the combined download count of the GPU Thermal Monitor has hit 520 :)

So far I'm yet to receive any major feedback on bugs etc which leads me to believe it: a) works perfectly or b) no-one is bothering to report issues. As I'm an optimist I'm going with option a :)

I've had more requests for remote monitoring of the GPU temperature via a simple http request. This is something I need myself in order to keep track of temperatures in remote machines. This is now built in and in testing and bug fixing, hopefully to be released soon. I've not used completion ports as they seemed like overkill for what should be a light traffic application but as the source is included and under creative commons license please feel free to add them if needed. Secondly having it open source allows for some code review, which is important for security reasons as it now allows remote connections.

If you have found a bug or would like another feature added please drop me a comment or email.

Amdahl's law

2009-10-19T03:50:00.000-07:00

A few months ago I made a post mentioning how I don't conform to the Amdahl's law way of thinking but never went into any details.

The law describes the speedup that can be obtained if you can parallelize a section of your problem. The speedup that can be obtained is described by the following equation:

\${\sfrac{1}{(1-P)+\sfrac{P}{S}}}\$

Where P is the proportion of the problem that can be parallelized / sped up and S is the speedup amount.

Assuming that S->infinity then P/S -> 0 this leave us with \${\sfrac{1}{(1-P)}}\$

This implies that no matter how many processors / speed improvements we make to the P portion of the problem we can never do better than \${\sfrac{1}{(1-P)}}\$ And the biggest % improvement from the baseline comes with low values of S (or relatively low numbers of parallel processors). This result is observed in the field time and again. Very seldom does throwing more than 4 or 8 processors at a problem speed it up any more than the large gains you get from the first 2 or 4 processors.

This equation does expand with multiple P and associated S terms in order to describe a more complex / lengthly problem: (P1+P2+P3 = 100%)

\${\sfrac{1}{(1-P1)+\sfrac{P1}{S1}}}+{\sfrac{1}{(1-P2)+\sfrac{P2}{S2}}}+{\sfrac{1}{(1-P3)+\sfrac{P3}{S3}}}\$

Certain problems where P is large do respond well to the increase in processors these are known as "embarrassingly parallel", ray tracing is rather a good example of this.

So why do I not agree with this if the equation makes sense?

The assumption that only P areas can be accelerated by S and strung together in a serial fashion is rather simplistic.

Why do we have to finish P1 before beginning P2? Even if the P2 area has dependancies on P1 its rare to have the entire section of P2 to depend on a single result (of course there are cases - reduction kernels etc)

Maybe P3 can overlap P1 and P2, some may benefit by having more processors while others may reach an optimal at two. Why not overlap the sections and supply them with their optimal processing power? This is easy to achieve with Directed Acyclic Graphs (DAG's) and can even be computed on the "fly" although they do get rather large!

Quoting Amdahl's law as a reason why no further speed benefits are available in a system is really just showing that thinking is still stuck in serial mode with little bursts of parallelism thrown in. Lets starting thinking parallel in all areas and make the most of all available compute resources.

Fermi

2009-10-01T03:31:00.000-07:00

I've just completed reading the white paper released by nvidia which you can find here.

Rather interestingly no mention of graphics performance has been made which, in a way, is really exciting. This has clearly been aimed at the high performance or throughput computing markets with the notable inclusion of ECC memory and increased double precision throughput along with the updated IEEE 754-2008 floating point support.

Concurrent kernel execution and faster context switching will allow, with the use of DAG's, the optimization of execution on the devices rather than just working out the most efficient order of kernels to execute sequentially.

Also tucked away in the white paper is the mention of have predication at the instruction level which should give greater control of divergent paths in your kernels.

The inclusion of C++ support will appeal to a lot of people but am I rather unconvinced this is the correct way to go for throughput computing as it will encourage the use of all the old patterns that may work well in serial cases but are often rather poor for enabling maximum throughput.

There is a lot more in the paper and already an announcement by Oak Ridge that they will be using it in a new supercomputer.

All in all its a wonderful development and I can't help feeling that computing took a substantial leap forward today.