2.86. Matrix Multiplication, Scaling, and GELU

Performs a dense matrix multiplication on the input, scales the resulting values by dividing by a constant, and calculates the Gaussian Error Linear Unit (GELU) activation on the scaled data.

Progress

Step 4 of 4

Reward

1.57x

Optimization Timeline

Step 4 Details

Reasoning

We can still try to squeeze more performance from our tiled GEMM kernel by leveraging intrinsic functions that accelerate division and hyperbolic tangent functions. In particular, replacing the division operator with __fdividef and tanhf with __tanhf may offer a small speed improvement. We already unrolled our inner loop using #pragma unroll, and our tile sizes are fixed at 16. These small tweaks are genuine micro-optimizations that reduce instruction count and enhance throughput on the GPU. Before finalizing, I ask myself: Have my improvements genuinely helped? Using __fdividef and __tanhf instead of the standard functions provides lower-precision, but is acceptable for our fusion of GEMM, division and GELU. These changes are real improvements and not just workarounds. I now provide the updated Python module (which remains unchanged) and the CUDA kernel with these micro-optimizations. --- FIX --- The kernel compilation failed because we attempted to call __tanhf from device code, which is not allowed. Instead, we can simply use tanhf (provided by CUDA's device math library) for our hyperbolic tangent calculation. This will solve the compilation error while preserving our optimization (as tanhf is a fast intrinsic on CUDA devices). With this fix, our tiled GEMM kernel will compute the output correctly, fuse bias addition, division (using __fdividef), and GELU activation using the tanh approximation. Now, I'll update the CUDA kernel accordingly.

// This kernel implements a tiled GEMM for computing:
//   Y = A * (B^T), where A: [M x K] (input matrix)
//   and B: [N x K] (weight matrix).
// So that Y[i][j] = sum_{k=0}^{K-1} A[i][k] * B[j][k],
// and then it fuses bias addition, division by a scalar,
// and GELU activation using the tanh approximation.
//
// This version applies micro-optimizations by using __fdividef
// for division and tanhf for the hyperbolic tangent.
#include <torch/extension.h>
#include <c10/cuda/CUDAGuard.h>
#include <cuda_runtime.h>
#include <math.h>

#define TILE_M 16
#define TILE_N 16
#define TILE_K 16

__global__ void fused_gemm_div_gelu_tiled_kernel(const float* __restrict__ A, 
                                                 const float* __restrict__ B, 
                                                 const float* __restrict__ bias, 
                                                 float* __restrict__ C,
                                                 int M, int N, int K, 
                                                 float divisor) {
    // Calculate the row and column indices for the output tensor.
    int row = blockIdx.y * TILE_M + threadIdx.y;
    int col = blockIdx.x * TILE_N + threadIdx.x;
    
    float acc = 0.0f;
    
    // Shared memory tiles for A and B.
    __shared__ float sA[TILE_M][TILE_K];
    __shared__ float sB[TILE_K][TILE_N];
    
    int numTiles = (K + TILE_K - 1) / TILE_K;
    for (int t = 0; t < numTiles; t++) {
        // Load a tile of A into shared memory.
        int a_col = t * TILE_K + threadIdx.x;
        if (row < M && a_col < K)
            sA[threadIdx.y][threadIdx.x] = A[row * K + a_col];
        else
            sA[threadIdx.y][threadIdx.x] = 0.0f;
        
        // Load a tile of B into shared memory.
        // Note: B is stored as [N x K], but we load it transposed.
        int b_index = t * TILE_K + threadIdx.y;
        if (col < N && b_index < K)
            sB[threadIdx.y][threadIdx.x] = B[col * K + b_index];
        else
            sB[threadIdx.y][threadIdx.x] = 0.0f;
        
        __syncthreads();
        
        // Compute the partial product for the tile.
        #pragma unroll
        for (int k = 0; k < TILE_K; k++) {
            acc += sA[threadIdx.y][k] * sB[k][threadIdx.x];
        }
        
        __syncthreads();
    }
    
    // Write the final output with fused bias addition, division, and GELU activation.
    if (row < M && col < N) {
        float val = acc + bias[col];
        val = __fdividef(val, divisor);
        float x = val;
        float x_cube = x * x * x;
        float tanh_val = tanhf(0.79788456f * (x + 0.044715f * x_cube));
        C[row * N + col] = 0.5f * x * (1.0f + tanh_val);
    }
}

torch::Tensor forward(torch::Tensor x, torch::Tensor weight, torch::Tensor bias, float divisor) {
    // x: [M, K] where M = batch_size and K = input_size.
    // weight: [N, K] where N = output_size.
    c10::cuda::CUDAGuard device_guard(x.device());
    
    int M = x.size(0);
    int K = x.size(1);
    int N = weight.size(0);
    
    TORCH_CHECK(weight.size(1) == K, "Mismatch between input size and weight dimensions.");
    
    auto output = torch::empty({M, N}, x.options());
    
    dim3 block(TILE_N, TILE_M);
    dim3 grid((N + TILE_N - 1) / TILE_N, (M + TILE_M - 1) / TILE_M);
    
    fused_gemm_div_gelu_tiled_kernel<<<grid, block, 0, c10::cuda::getCurrentCUDAStream()>>>(
        x.data_ptr<float>(),
        weight.data_ptr<float>(),
        bias.data_ptr<float>(),
        output.data_ptr<float>(),
        M, N, K, divisor);
        
    return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &forward, "Fused Tiled GEMM, division and GELU activation (CUDA)");
}