<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: CUDA Emulator Output</title>
	<atom:link href="http://www.bv2.co.uk/?feed=rss2&#038;p=910" rel="self" type="application/rss+xml" />
	<link>http://www.bv2.co.uk/?p=910</link>
	<description>the same, maybe better?</description>
	<pubDate>Tue, 07 Sep 2010 05:44:18 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6</generator>
		<item>
		<title>By: Fortran &#38; CUDA &#187; Phần 4: Fortran + CUDA + Accelerator (0) install + compile</title>
		<link>http://www.bv2.co.uk/?p=910#comment-2946</link>
		<dc:creator>Fortran &#38; CUDA &#187; Phần 4: Fortran + CUDA + Accelerator (0) install + compile</dc:creator>
		<pubDate>Wed, 16 Dec 2009 03:43:25 +0000</pubDate>
		<guid isPermaLink="false">http://www.bv2.co.uk/?p=910#comment-2946</guid>
		<description>[...] http://www.bv2.co.uk/?p=910     Categories: Computational Modelling, Tin Sinh học Tags: CUDA, Fortran, supercomputing with CUDA        Comments (0) Trackbacks (0) Leave a comment Trackback [...]</description>
		<content:encoded><![CDATA[<p>[...] <a href="http://www.bv2.co.uk/?p=910" rel="nofollow">http://www.bv2.co.uk/?p=910</a>     Categories: Computational Modelling, Tin Sinh học Tags: CUDA, Fortran, supercomputing with CUDA        Comments (0) Trackbacks (0) Leave a comment Trackback [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: bert</title>
		<link>http://www.bv2.co.uk/?p=910#comment-2226</link>
		<dc:creator>bert</dc:creator>
		<pubDate>Fri, 16 Oct 2009 11:16:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.bv2.co.uk/?p=910#comment-2226</guid>
		<description>Very interesting. Thx for the info!</description>
		<content:encoded><![CDATA[<p>Very interesting. Thx for the info!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Phần 4: Fortran + CUDA (0) install + compile &#171; Vietnamen&#8217;s Weblog</title>
		<link>http://www.bv2.co.uk/?p=910#comment-2184</link>
		<dc:creator>Phần 4: Fortran + CUDA (0) install + compile &#171; Vietnamen&#8217;s Weblog</dc:creator>
		<pubDate>Thu, 08 Oct 2009 20:23:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.bv2.co.uk/?p=910#comment-2184</guid>
		<description>[...] http://www.bv2.co.uk/?p=910 [...]</description>
		<content:encoded><![CDATA[<p>[...] <a href="http://www.bv2.co.uk/?p=910" rel="nofollow">http://www.bv2.co.uk/?p=910</a> [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Barrett</title>
		<link>http://www.bv2.co.uk/?p=910#comment-1029</link>
		<dc:creator>Barrett</dc:creator>
		<pubDate>Tue, 16 Jun 2009 08:48:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.bv2.co.uk/?p=910#comment-1029</guid>
		<description>Hi Eri,

You are correct, operation ordering is important. Although I did make sure the order of operations in the ptx was the "same", exactly what order they get placed in when compiled I could not be sure of without using a disassembler.
Quite an good exercise is to rearrange your commutative operations in n! ways and then compare the results.
For example:  x+y-z*(a+c-d)

gives you: (partial list)
x-z(a+c-d)+y
y+x-z(-d+c+a)
etc

etc. It is also possible to calculate your estimated max/min error based on your precision and order of operations</description>
		<content:encoded><![CDATA[<p>Hi Eri,</p>
<p>You are correct, operation ordering is important. Although I did make sure the order of operations in the ptx was the &#8220;same&#8221;, exactly what order they get placed in when compiled I could not be sure of without using a disassembler.<br />
Quite an good exercise is to rearrange your commutative operations in n! ways and then compare the results.<br />
For example:  x+y-z*(a+c-d)</p>
<p>gives you: (partial list)<br />
x-z(a+c-d)+y<br />
y+x-z(-d+c+a)<br />
etc</p>
<p>etc. It is also possible to calculate your estimated max/min error based on your precision and order of operations</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Eri</title>
		<link>http://www.bv2.co.uk/?p=910#comment-1003</link>
		<dc:creator>Eri</dc:creator>
		<pubDate>Thu, 11 Jun 2009 13:49:41 +0000</pubDate>
		<guid isPermaLink="false">http://www.bv2.co.uk/?p=910#comment-1003</guid>
		<description>the reason you don't get exactly the same results is simple. in floating point operations a +b+c != c+b+a meaning that do to floating point error and rounding a different order of operations will probably result in a different answer. As part of the vector matrix multiply is a vector collapse, the ordering of the operations is different, even between runs on the gpu i believe</description>
		<content:encoded><![CDATA[<p>the reason you don&#8217;t get exactly the same results is simple. in floating point operations a +b+c != c+b+a meaning that do to floating point error and rounding a different order of operations will probably result in a different answer. As part of the vector matrix multiply is a vector collapse, the ordering of the operations is different, even between runs on the gpu i believe</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Barrett</title>
		<link>http://www.bv2.co.uk/?p=910#comment-1001</link>
		<dc:creator>Barrett</dc:creator>
		<pubDate>Wed, 10 Jun 2009 18:01:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.bv2.co.uk/?p=910#comment-1001</guid>
		<description>I've got nvcc to play nicely with gcc on my linux box but it doesn't want to use my mingw gcc on the windows box at all.  Strange really as the command line is the same. Fairly irritating as I really wanted the mingw gcc object file as it links nicely with gfortran...

Thanks for the compiler / config details - I'll give that a try later tonight on some image processing code I'm busy with.

By the way I see your hobbies are listed as:  Fighting and Road Racing - so I'll make sure I approve your comments extra quickly from now on! :)</description>
		<content:encoded><![CDATA[<p>I&#8217;ve got nvcc to play nicely with gcc on my linux box but it doesn&#8217;t want to use my mingw gcc on the windows box at all.  Strange really as the command line is the same. Fairly irritating as I really wanted the mingw gcc object file as it links nicely with gfortran&#8230;</p>
<p>Thanks for the compiler / config details - I&#8217;ll give that a try later tonight on some image processing code I&#8217;m busy with.</p>
<p>By the way I see your hobbies are listed as:  Fighting and Road Racing - so I&#8217;ll make sure I approve your comments extra quickly from now on! <img src='http://www.bv2.co.uk/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Timothy Farrar</title>
		<link>http://www.bv2.co.uk/?p=910#comment-1000</link>
		<dc:creator>Timothy Farrar</dc:creator>
		<pubDate>Wed, 10 Jun 2009 17:42:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.bv2.co.uk/?p=910#comment-1000</guid>
		<description>I've been using nvcc with gcc on a 64-bit Linux box, and don't have cuda on a windows machine (at home) to play with. If you do end up attempting gcc on a windows machine I'd suggest going with mingw. I've go no idea on nvcc support for gcc on windows machines however. BTW, /arch:SSE2 can be used with MSVC to get the compiler to generate SSE2 floating point code. If you are using a MSVC project the option can be found at Configuration Properties-&#62;C/C++-&#62;Enable Enhanced Instruction Set.</description>
		<content:encoded><![CDATA[<p>I&#8217;ve been using nvcc with gcc on a 64-bit Linux box, and don&#8217;t have cuda on a windows machine (at home) to play with. If you do end up attempting gcc on a windows machine I&#8217;d suggest going with mingw. I&#8217;ve go no idea on nvcc support for gcc on windows machines however. BTW, /arch:SSE2 can be used with MSVC to get the compiler to generate SSE2 floating point code. If you are using a MSVC project the option can be found at Configuration Properties-&gt;C/C++-&gt;Enable Enhanced Instruction Set.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Siva</title>
		<link>http://www.bv2.co.uk/?p=910#comment-999</link>
		<dc:creator>Siva</dc:creator>
		<pubDate>Wed, 10 Jun 2009 17:29:30 +0000</pubDate>
		<guid isPermaLink="false">http://www.bv2.co.uk/?p=910#comment-999</guid>
		<description>I am learning LBM. I need LBM code for lid driven cavity. Thank you.</description>
		<content:encoded><![CDATA[<p>I am learning LBM. I need LBM code for lid driven cavity. Thank you.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Barrett</title>
		<link>http://www.bv2.co.uk/?p=910#comment-998</link>
		<dc:creator>Barrett</dc:creator>
		<pubDate>Wed, 10 Jun 2009 15:11:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.bv2.co.uk/?p=910#comment-998</guid>
		<description>Hi Timothy,

Thanks for the tips.  By "turned down" I was meaning turn the FPU down to 32 bits to match the CUDA floats internally.

I have actually never used SSE(2) in MSVC or any other C compiler for that matter - I rather implement them in a pure asm module (dinosoar....) as then I know exactly whats going on and not what a compiler has decided for me. 

My build rules are set with the -keep option already :) There is a lot of information in those files which otherwise is hard to obtain - lmem usage for example. 

That said: I did not check the compiler re-ordering the FPU instructions....  op ordering often causes rounding / trunc errors. Thanks for the advice :) I'll have a look at the code this evening.

By the way - you mentioned gcc - have you ever managed to get nvcc to use gcc as the foreign compiler on a Windows machine?</description>
		<content:encoded><![CDATA[<p>Hi Timothy,</p>
<p>Thanks for the tips.  By &#8220;turned down&#8221; I was meaning turn the FPU down to 32 bits to match the CUDA floats internally.</p>
<p>I have actually never used SSE(2) in MSVC or any other C compiler for that matter - I rather implement them in a pure asm module (dinosoar&#8230;.) as then I know exactly whats going on and not what a compiler has decided for me. </p>
<p>My build rules are set with the -keep option already <img src='http://www.bv2.co.uk/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> There is a lot of information in those files which otherwise is hard to obtain - lmem usage for example. </p>
<p>That said: I did not check the compiler re-ordering the FPU instructions&#8230;.  op ordering often causes rounding / trunc errors. Thanks for the advice <img src='http://www.bv2.co.uk/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> I&#8217;ll have a look at the code this evening.</p>
<p>By the way - you mentioned gcc - have you ever managed to get nvcc to use gcc as the foreign compiler on a Windows machine?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Timothy Farrar</title>
		<link>http://www.bv2.co.uk/?p=910#comment-997</link>
		<dc:creator>Timothy Farrar</dc:creator>
		<pubDate>Wed, 10 Jun 2009 14:54:25 +0000</pubDate>
		<guid isPermaLink="false">http://www.bv2.co.uk/?p=910#comment-997</guid>
		<description>Just in case you haven't already seen this info,

You can enable SSE2 code generation in both cl (MSVC) and gcc and get it so that the old 80bit x86 float stack isn't used (guessing this is what you ment by "turned down"). In fact if you are on (and compiling for) a 64-bit OS likely this is on by default. Also be careful with MSVC. Simply setting SSE on but not SSE2 on will result in the compiler still using the 80bit x86 float stack because SSE 32bit float wasn't faster on early SSE1 only hardware. If I remember right, it will actually mix 80bit and SSE 32bit float operations in some cases.

BTW you might also want to set nvcc to keep intermediate output files. In one of those intermediate files you can actually see the functions used to emulate the GPU (seems like all the CUDA emulation stuff gets tossed into one file). Other things you might want to check out are issues with the compiler re-ordering FPU operations and issues with fused multiply+add vs x86 being separate. For MSVC might want to look at the /fp: options...</description>
		<content:encoded><![CDATA[<p>Just in case you haven&#8217;t already seen this info,</p>
<p>You can enable SSE2 code generation in both cl (MSVC) and gcc and get it so that the old 80bit x86 float stack isn&#8217;t used (guessing this is what you ment by &#8220;turned down&#8221;). In fact if you are on (and compiling for) a 64-bit OS likely this is on by default. Also be careful with MSVC. Simply setting SSE on but not SSE2 on will result in the compiler still using the 80bit x86 float stack because SSE 32bit float wasn&#8217;t faster on early SSE1 only hardware. If I remember right, it will actually mix 80bit and SSE 32bit float operations in some cases.</p>
<p>BTW you might also want to set nvcc to keep intermediate output files. In one of those intermediate files you can actually see the functions used to emulate the GPU (seems like all the CUDA emulation stuff gets tossed into one file). Other things you might want to check out are issues with the compiler re-ordering FPU operations and issues with fused multiply+add vs x86 being separate. For MSVC might want to look at the /fp: options&#8230;</p>
]]></content:encoded>
	</item>
</channel>
</rss>
