When I worked at a bingo software supplier I used to design and implement the server side systems. To most developers solving a bingo card based on a random draw seems trivial, and so it is. Factor in the following: tens of thousands of cards, very short time span to calculate results, who won, 3,2,1 to go, bonus shapes - and it soon becomes quite a challenging realtime problem. Thereafter you need to produce output which the (typically flash) client is able to read and process quickly which is a really big string handling problem (very often 10 million string pieces).
Of course all of this is possible in C# or C++. I've seen systems written in these take up to 30 seconds to calculate the results (and you thought the purchase "settle" time was just to let late purchases caused by latency etc though ....) While even 10 seconds seems acceptable it causes a big loss of profit for the bingo operators. The bingo sites make money by taking a small rake from every card purchased. The more cards purchased and the higher the turnaround of games the more money they make.
I solved this by writing the card / draw / string system with SIMD style instructions in asm which worked rather nicely and could solve the above problem in under 500ms. This was still only for a single bingo hall and room. To do a whole bingo network at once (say a system wide game with a jackpot) could take quite a bit longer and use rather a large chunk of memory.
I'm a firm believer in parallelism and most (if not all) problems can be resolved with a bit of careful thought in a parallel way. Bingo systems would benefit hugely from a CUDA style implementation. (Un)fortunately? I left the bingo business before GPU systems became more mainstream. A 1u tesla box would easily pay for itself in a gaming environment so cost wouldn't be an issue. Actually casino / gaming hosting is extremely expensive so only using 1u for 4 teraflops would make fantastic commercial sense.
My proposed implementation would be something like this:
Generate the ball draws using the standard RNG (very often connected to a radioactive source)
Transfer the ball draws to the GPU's constant memory (there are not many of them)
Transfer all the cards to the GPU and bind as textures
Compare the ball draw order to the cards and compare (using logical operators) to bonus shapes using 1 thread per card
Write out the hit order for the card to another section of memory along with 3,2,1 to go
using another kernel construct all the strings
Copy back from the gpu devices
If you fire that off with 10k or more threads you can solve 10k cards in a few milliseconds. The benefit here would be using it to solve 100k or more cards at once - 100k threads is still trivial for a CUDA device.
The above may seem like a very silly example of parallel programming but I think it serves to demonstrate that anything can be converted to use CUDA AND you will get performance benefits and make real monetary savings.
The old regime of thinking serially and looking for chances to use parallelism needs to be changed to one where you immediately think parallel and only use serial where you absolutely have to. Moores law is basically finished the only way we can get continued increases in speed are using multiple cores working on a problem together.
Now to implement this and make millions.... or perhaps not :p