Boosting Matrix Multiplication Speed and Flexibility with NVIDIA cuBLAS 12.9