1.13. Symmetric Matrix Multiplication
Calculates the product of two symmetric matrices. The operation first ensures the matrices are symmetric by averaging them with their transpose and then performs standard matrix multiplication to produce the result.
Progress
Optimization Timeline
Step 5 Details
Reasoning
While our previous custom operator wrapping cuBLAS achieved a speedup factor of about 1.014 over native, one way to further optimize it is to remove repetitive handle creation and destruction overhead by using a persistent cuBLAS handle. Creating and destroying the cuBLAS handle on every call adds extra overhead, especially when the operator is invoked repeatedly. Instead, we can create a static persistent handle that is initialized on the first call and then re-used on subsequent calls. We update the code to get a persistent handle (per device assumed) and set its stream each time the operator is called. This modification should reduce the minor overhead from the handle’s lifecycle and further optimize performance. Before finishing, I ask: Have my improvements genuinely helped? Yes – by reusing the handle, we reduce repeated startup cost. This is a genuine optimization rather than a workaround, leveraging cuBLAS more efficiently without altering the correctness of our multiplication. Below is the updated Python and CUDA code.
#include <torch/extension.h>
#include <c10/cuda/CUDAGuard.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <stdio.h>
#include <stdlib.h>
// Helper macro for cuBLAS error checking.
#define CHECK_CUBLAS(status) { \
if (status != CUBLAS_STATUS_SUCCESS) { \
printf("cuBLAS error: %d\n", status); \
exit(EXIT_FAILURE); \
} \
}
// Function to retrieve a persistent cuBLAS handle.
static cublasHandle_t getCublasHandle() {
// This static variable ensures that the handle is created only once.
static cublasHandle_t handle = nullptr;
if (handle == nullptr) {
CHECK_CUBLAS(cublasCreate(&handle));
}
return handle;
}
torch::Tensor matmul(torch::Tensor A, torch::Tensor B) {
// Ensure the operation is performed on the correct CUDA device.
c10::cuda::CUDAGuard device_guard(A.device());
int N = A.size(0);
// Allocate output tensor C.
auto C = torch::empty({N, N}, A.options());
// cuBLAS uses column-major order, while PyTorch uses row-major.
// We use the trick of swapping A and B in the GEMM call.
float alpha = 1.0f;
float beta = 0.0f;
// Retrieve the persistent cuBLAS handle.
cublasHandle_t handle = getCublasHandle();
// Set the handle to use the current CUDA stream.
cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
CHECK_CUBLAS(cublasSetStream(handle, stream));
// Compute GEMM as C = B * A to account for differences in memory layout.
// All matrices are N x N with leading dimensions of N.
cublasStatus_t stat = cublasSgemm(
handle,
CUBLAS_OP_N, // No transpose on B.
CUBLAS_OP_N, // No transpose on A.
N, // Number of rows of op(B) and C.
N, // Number of columns of op(A) and C.
N, // Inner dimension.
&alpha,
B.data_ptr<float>(), // Matrix B.
N, // Leading dimension of B.
A.data_ptr<float>(), // Matrix A.
N, // Leading dimension of A.
&beta,
C.data_ptr<float>(), // Matrix C.
N); // Leading dimension of C.
CHECK_CUBLAS(stat);
// We do not destroy the handle here because it is persistent.
return C;
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("matmul", &matmul, "High-performance symmetric matrix multiplication using permanent cuBLAS handle");
}