How to Improve CUDA Kernel Performance with Shared Memory Register Spilling