1.49. Max Reduction Operation

Calculates the maximum value over a specified dimension of the input data, reducing the tensor along that dimension.

Progress

Step 1 of 1

Reward

1.86x

Optimization Timeline

Step 1 Details

Reasoning

We need to create a fused operator for performing a max reduction over a specified dimension. In this problem, the model performs a max reduction over a single dimension (passed as an argument), and get_init_inputs returns 1 indicating the reduction is over dimension 1. For an input tensor x of shape [batch, dim1, dim2], reducing over dimension 1 yields an output of shape [batch, dim2]. We implement a CUDA kernel that, for each output element at index (b, c), iterates over the reduction dimension (dim1) and computes the maximum of x[b, r, c]. For simplicity and given the relatively small size (e.g., 256 elements along the reduction dimension), we assign one thread per output element and perform the reduction sequentially. We use c10::cuda::CUDAGuard to ensure that operations happen on the correct device and launch the kernel on the current CUDA stream. The Python module loads the CUDA extension with extra flags "-O3" and "--use_fast_math" and defines a class CUDAModel whose forward method calls our custom operator.

#include <torch/extension.h>
#include <c10/cuda/CUDAGuard.h>
#include <cuda.h>
#include <cuda_runtime.h>

// Kernel for max reduction over dimension 1.
// Input tensor x is assumed to have shape [B, R, C].
// The output tensor will have shape [B, C], where for each output element:
//   output[b, c] = max_{r=0}^{R-1} x[b, r, c].
__global__ void max_reduce_dim1_kernel(const float* __restrict__ x,
                                       float* __restrict__ output,
                                       int B, int R, int C) {
    // Compute the index in the output tensor.
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int total = B * C;
    if (index < total) {
        int b = index / C;
        int c = index % C;
        // Initialize with the first element in the reduction dimension.
        int base = b * R * C;
        float max_val = x[base + c]; // element at r=0.
        // Loop over r from 1 to R-1.
        for (int r = 1; r < R; r++) {
            float val = x[base + r * C + c];
            if (val > max_val) {
                max_val = val;
            }
        }
        output[index] = max_val;
    }
}

torch::Tensor max_reduce(torch::Tensor x, int reduce_dim) {
    // Currently, only support reduction over dimension 1.
    if (reduce_dim != 1) {
        throw std::runtime_error("Only reduction over dimension 1 is supported.");
    }
    
    // Use CUDAGuard to enforce correct device.
    c10::cuda::CUDAGuard device_guard(x.device());
    
    // Expect x to be 3D: [B, R, C]
    TORCH_CHECK(x.dim() == 3, "Input tensor must be 3-dimensional");
    int B = x.size(0);
    int R = x.size(1);
    int C = x.size(2);
    
    // Output tensor shape: [B, C]
    auto output = torch::empty({B, C}, x.options());
    
    int total = B * C;
    int threads = 256;
    int blocks = (total + threads - 1) / threads;
    
    max_reduce_dim1_kernel<<<blocks, threads, 0, c10::cuda::getCurrentCUDAStream()>>>(
        x.data_ptr<float>(),
        output.data_ptr<float>(),
        B, R, C);
    
    return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("max_reduce", &max_reduce, "Max reduction over dimension 1 using a custom CUDA kernel");
}