Wednesday 10 June 2009

CUDA Emulator Output

As many people have noticed the same code executed in Emulator mode gives different floating point results from the kernels run in Debug or Release mode.

Although I know what causes this I have never bothered to investigate the actual differences as most of the stuff I write runs entirely on the GPU. Recently I have had to compare results on the CPU<->GPU and wrote some code to change the FPU settings. Firstly a quick explanation:

By default the CPU (FPU) is set to use 80 bit floating point internally. This means that when you load in an integer (fild) or a single / double float (fld) it gets converted to a 80 bit number inside the FPU stack. All operations are performed internally at 80 bits and when storing the result it converts back to the correct floating point width (single / double)  (fst / fstp). 

This method of operation is desirable as it reduces the effect of rounding / truncating on the intermediate results.  Of course while very useful for computing on the CPU this is not how the CUDA devices operate.



In CUDA all operations on a float occur at 32 bits (64 bits for a double) which means your intermediate operations will sometimes lose precision. In CUDA Emulator mode your code is actually run on the CPU and it uses the FPU's default precision and rounding settings.  This causes the difference in output.

For my testing I modified the Matrix Mul sample in the CUDA SDK to include code to change the CPU settings before running the Gold Kernel. (Code link follows below)

I turned down the CPU internal precision to 32 bits in order to match the 32bit floats the CUDA kernel uses. For emulator mode I made sure the CPU was turned down to the same precision before running the Kernel. As expected the Gold and CUDA kernels outputs match perfectly.

Next I ran in Debug mode (the kernel will now execute on the GPU). As both the Gold kernel and Cuda kernel are now at 32 bits I expected the results to be the same. Rather interestingly it turned out that they are slightly different. I then tried changing the CPU rounding settings hoping to get the results to match up.

After trying all the rounding settings I discovered that the default setting (round to nearest or even) gave the closest results to the Gold kernel BUT they are still slightly out.  I suspect this is down to differences in the internal workings of the FPU units on the GPU.

So in summary: If you are trying to compare kernel results between Emulator and Release mode you will never get exactly the same results but the differences can be mitigated somewhat by changing the CPU/FPU's internal precision settings. 

Now the code - released under the Creative Commons - Attribution-Share Alike 2.0 UK: England & Wales license - please download from here [download id="3"] or visit the downloads section.

As always please report any bugs.

 This posts forum topic is here.

* I've used CPU/FPU interchangably in this post. Back in ye olde days they did used to be seperate chips so please excuse me :)

* The code will NOT compile under GCC compilers as I've used the Microsoft inline asm syntax.

11 comments:

  1. Thanks. Great information.
    I was also wandering why the results are different. This clarifies some things.

    ReplyDelete
  2. Just in case you haven't already seen this info,

    You can enable SSE2 code generation in both cl (MSVC) and gcc and get it so that the old 80bit x86 float stack isn't used (guessing this is what you ment by "turned down"). In fact if you are on (and compiling for) a 64-bit OS likely this is on by default. Also be careful with MSVC. Simply setting SSE on but not SSE2 on will result in the compiler still using the 80bit x86 float stack because SSE 32bit float wasn't faster on early SSE1 only hardware. If I remember right, it will actually mix 80bit and SSE 32bit float operations in some cases.

    BTW you might also want to set nvcc to keep intermediate output files. In one of those intermediate files you can actually see the functions used to emulate the GPU (seems like all the CUDA emulation stuff gets tossed into one file). Other things you might want to check out are issues with the compiler re-ordering FPU operations and issues with fused multiply+add vs x86 being separate. For MSVC might want to look at the /fp: options...

    ReplyDelete
  3. Hi Timothy,

    Thanks for the tips. By "turned down" I was meaning turn the FPU down to 32 bits to match the CUDA floats internally.

    I have actually never used SSE(2) in MSVC or any other C compiler for that matter - I rather implement them in a pure asm module (dinosoar....) as then I know exactly whats going on and not what a compiler has decided for me.

    My build rules are set with the -keep option already :) There is a lot of information in those files which otherwise is hard to obtain - lmem usage for example.

    That said: I did not check the compiler re-ordering the FPU instructions.... op ordering often causes rounding / trunc errors. Thanks for the advice :) I'll have a look at the code this evening.

    By the way - you mentioned gcc - have you ever managed to get nvcc to use gcc as the foreign compiler on a Windows machine?

    ReplyDelete
  4. I am learning LBM. I need LBM code for lid driven cavity. Thank you.

    ReplyDelete
  5. I've been using nvcc with gcc on a 64-bit Linux box, and don't have cuda on a windows machine (at home) to play with. If you do end up attempting gcc on a windows machine I'd suggest going with mingw. I've go no idea on nvcc support for gcc on windows machines however. BTW, /arch:SSE2 can be used with MSVC to get the compiler to generate SSE2 floating point code. If you are using a MSVC project the option can be found at Configuration Properties->C/C++->Enable Enhanced Instruction Set.

    ReplyDelete
  6. I've got nvcc to play nicely with gcc on my linux box but it doesn't want to use my mingw gcc on the windows box at all. Strange really as the command line is the same. Fairly irritating as I really wanted the mingw gcc object file as it links nicely with gfortran...

    Thanks for the compiler / config details - I'll give that a try later tonight on some image processing code I'm busy with.

    By the way I see your hobbies are listed as: Fighting and Road Racing - so I'll make sure I approve your comments extra quickly from now on! :)

    ReplyDelete
  7. the reason you don't get exactly the same results is simple. in floating point operations a +b+c != c+b+a meaning that do to floating point error and rounding a different order of operations will probably result in a different answer. As part of the vector matrix multiply is a vector collapse, the ordering of the operations is different, even between runs on the gpu i believe

    ReplyDelete
  8. Hi Eri,

    You are correct, operation ordering is important. Although I did make sure the order of operations in the ptx was the "same", exactly what order they get placed in when compiled I could not be sure of without using a disassembler.
    Quite an good exercise is to rearrange your commutative operations in n! ways and then compare the results.
    For example: x+y-z*(a+c-d)

    gives you: (partial list)
    x-z(a+c-d)+y
    y+x-z(-d+c+a)
    etc

    etc. It is also possible to calculate your estimated max/min error based on your precision and order of operations

    ReplyDelete
  9. Very interesting. Thx for the info!

    ReplyDelete
  10. [...] http://www.bv2.co.uk/?p=910 Categories: Computational Modelling, Tin Sinh học Tags: CUDA, Fortran, supercomputing with CUDA Comments (0) Trackbacks (0) Leave a comment Trackback [...]

    ReplyDelete