NVidia announced CUDA 9 a few weeks ago. I've been using CUDA since v1.1 and compute capability 1, and things have matured significantly over the years.
The new CUDA adds support for the new Volta architecture, C++14, faster libraries and Tensor core matrix multiply, which is clearly targeting deep learning applications. But, for me, there is one stand out feature: Cooperative Groups.
The release says that it is a new programming model for managing groups of communicating threads. What does that really mean?
Previously you could synchronize threads across a thread block with the __syncthreads() function. Cooperative groups allow you to define groups of threads at the sub-block and multi block levels and synchronization across the entire grid.
The grid sync means you now longer have to have multiple kernels operating in successive launches in order to complete a complex task on a data set. A single kernel can now operate on the data and using something like:
thread_group group_grid = this_grid();
//do something here
//do something else here
You also get a this_multi_grid() variant which will synchronize the kernel across all GPU's its been launched on!
You no longer have to wait to the sync at the end of a kernel launch and launch another kernel from the cpu code. Presumably you will still be restricted by the timeout on the driver for your primary device.
This coupled with the pinned memory / zero copy means you can have long running kernels running all sorts of operations on memory that can be streamed into the device.