Tuesday 16 December 2008

Bingo and CUDA

When I worked at a bingo software supplier I used to design and implement the server side systems. To most developers solving a bingo card based on a random draw seems trivial, and so it is. Factor in the following: tens of thousands of cards, very short time span to calculate results, who won,  3,2,1 to go, bonus shapes - and it soon becomes quite a challenging realtime problem. Thereafter you need to produce output which the (typically flash) client is able to read and process quickly which is a really big string handling problem (very often 10 million string pieces).



Of course all of this is possible in C# or C++. I've seen systems written in these take up to 30 seconds to calculate the results (and you thought the purchase "settle" time was just to let late purchases caused by latency etc though ....)  While even 10 seconds seems acceptable it causes a big loss of profit for the bingo operators. The bingo sites make money by taking a small rake from every card purchased. The more cards purchased and the higher the turnaround of games the more money they make. 

I solved this by writing the card / draw / string system with SIMD style instructions in asm which worked rather nicely and could solve the above problem in under 500ms. This was still only for a single bingo hall and room. To do a whole bingo network at once (say a system wide game with a jackpot) could take quite a bit longer and use rather a large chunk of memory.

I'm a firm believer in parallelism and most (if not all) problems can be resolved with a bit of careful thought in a parallel way. Bingo systems would benefit hugely from a CUDA style implementation. (Un)fortunately? I left the bingo business before GPU systems became more mainstream. A 1u tesla box would easily pay for itself in a gaming environment so cost wouldn't be an issue. Actually casino / gaming hosting is extremely expensive so only using 1u for 4 teraflops would make fantastic commercial sense.

My proposed implementation would be something like this:

Generate the ball draws using the standard RNG (very often connected to a radioactive source)

Transfer the ball draws to the GPU's constant memory (there are not many of them)

Transfer all the cards to the GPU and bind as textures

Compare the ball draw order to the cards and compare (using logical operators) to bonus shapes using 1 thread per card

Write out the hit order for the card to another section of memory along with 3,2,1 to go

using another kernel construct all the strings

Copy back from the gpu devices

 

If you fire that off with 10k or more threads you can solve 10k cards in a few milliseconds. The benefit here would be using it to solve 100k or more cards at once - 100k threads is still trivial for a CUDA device.

The above may seem like a very silly example of parallel programming but I think it serves to demonstrate that anything can be converted to use CUDA AND you will get performance benefits and make real monetary savings.

The old regime of thinking serially and looking for chances to use parallelism needs to be changed to one where you immediately think parallel and only use serial where you absolutely have to. Moores law is basically finished the only way we can get continued increases in speed are using multiple cores working on a problem together.

Now to implement this and make millions....  or perhaps not :p

3 comments:

  1. You and Bingo Mr Barrett. You'd have thought you'd had enough of it by now :)

    Though you're right, a seeming simple task made tricky by lots of permutations of patterns / bonus etc...

    ... and middle tier / front end devs expecting responses encoded in convoluted formats ;)

    ReplyDelete
  2. "Moores law is dead" indeed - we'll see!

    I say ditch silicon, graphene is the way for electron source / sinks - we'll squeeze a few more (or should I say moore) iterations of that law yet :)

    ReplyDelete
  3. I wonder who the front end dev was? :p

    Indeed I probably have had enough of bingo - hence the above is more of a thought experiment rather than an actual implementation.

    "Now to implement this and make millions…. or perhaps not :p" was there for a reason!

    Even if graphene does provide a few more iterations (whatever happened to GaAs?) it will eventually not provide any more speedups. Although it would be fantastic is quantum computing had taken off by then (more than a few gates would even be nice!) its currently not looking like it will happen any time soon. So the best thing in my opinion is to start developing a parallel algorithm mindset.

    Again congratulations on getting your code published Matt :)

    ReplyDelete