D'oh! After all my mucking around with performance calculations related to register, shared mem and global mem usage I discovered "CUDA_Occupancy_calculator.xls" lurking in the tools directory which does it all for you. Its even mentioned in the docs ... another D'oh!
I don't feel it was a complete waste of time as I now understand the inner workings of the multiprocessors a lot better.
If you do want to use the spreadsheet dont forget to compile with the -cubin option which will tell you the register usage / shared mem usage etc.