1.41. 1D Max Pooling Operation

Performs a 1D max pooling operation on input data by computing the maximum value over sliding windows. The operation supports configurable kernel size, stride, padding, dilation, and an option to return indices.

Progress

Step 3 of 3

Reward

1.94x

Optimization Timeline

Step 3 Details

Reasoning

We can further optimize the custom MaxPool1d CUDA kernel by reducing branch overhead inside the pooling loop. In the previous version, each iteration over the pooling window performed a bounds check using an if statement. Since the window size (kernel_size) is typically small, even small overhead in each iteration can add up. We can precompute the valid range of k values (the iteration variable for the pooling window) so that we only loop over valid indices without doing a bounds check for every k. In particular, we compute: • k_start, the smallest k such that start + k*dilation ≥ 0, which can be computed via: if (start < 0) k_start = ceil(-start/dilation) = (-start + dilation - 1) / dilation, else 0. • k_end, the maximum number of valid k, computed as floor((input_length-1 - start)/dilation)+1, capped at kernel_size. Then we loop from k_start to k_end (exclusive) and update the maximum with fmaxf and __ldg for optimal memory access. These changes remove the per-iteration branch on input bounds and should yield a further speedup over the native implementation. Before finalizing, we ask: Have these improvements genuinely helped? Yes, by reducing branch overhead and leveraging precomputation, we obtain a leaner kernel while preserving correctness.

#include <torch/extension.h>
#include <c10/cuda/CUDAGuard.h>
#include <cuda_runtime.h>
#include <float.h>

// Optimized CUDA kernel for 1D Max Pooling.
// Input shape: [batch, channels, input_length].
// Each thread computes one output element using precomputed valid window limits.
__global__ void maxpool1d_kernel(const float* __restrict__ input,
                                 float* __restrict__ output,
                                 int batch,
                                 int channels,
                                 int input_length,
                                 int output_length,
                                 int kernel_size,
                                 int stride,
                                 int padding,
                                 int dilation) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int total = batch * channels * output_length;
    if (index < total) {
        // Compute output indices.
        int out_index = index % output_length;
        int tmp = index / output_length;
        int channel = tmp % channels;
        int b = tmp / channels;
        
        // Compute the starting position in the input corresponding to the pooling window.
        int start = out_index * stride - padding;
        
        // Pre-compute base pointer index for (b, channel).
        int base = b * channels * input_length + channel * input_length;
        
        // Precompute valid k indices to avoid per-iteration bounds checks.
        int k_start = 0;
        if (start < 0) {
            k_start = (-start + dilation - 1) / dilation; // ceil(-start/dilation)
        }
        int potential_k_end = ((input_length - 1 - start) / dilation) + 1;  // valid k count if no cap
        int k_end = (potential_k_end < kernel_size) ? potential_k_end : kernel_size;
        
        // Initialize max value.
        float max_val = -FLT_MAX;
        #pragma unroll
        for (int k = k_start; k < k_end; k++) {
            int pos = start + k * dilation;
            // No bounds check needed as k runs over valid indices.
            float val = __ldg(&input[base + pos]);
            max_val = fmaxf(max_val, val);
        }
        output[index] = max_val;
    }
}

torch::Tensor max_pool1d_forward(torch::Tensor input,
                                 int kernel_size,
                                 int stride,
                                 int padding,
                                 int dilation,
                                 bool return_indices) {
    if (return_indices) {
        throw std::runtime_error("Custom CUDA MaxPool1d does not support return_indices=True.");
    }
    
    // Ensure correct CUDA device.
    c10::cuda::CUDAGuard device_guard(input.device());
    
    // Input shape: [batch, channels, input_length].
    int batch = input.size(0);
    int channels = input.size(1);
    int input_length = input.size(2);
    
    // Compute output length using the pooling formula.
    int output_length = (input_length + 2 * padding - dilation * (kernel_size - 1) - 1) / stride + 1;
    
    // Allocate the output tensor.
    auto output = torch::empty({batch, channels, output_length}, input.options());
    
    int total = batch * channels * output_length;
    int threads = 256;
    int blocks = (total + threads - 1) / threads;
    
    maxpool1d_kernel<<<blocks, threads, 0, c10::cuda::getCurrentCUDAStream()>>>(
        input.data_ptr<float>(),
        output.data_ptr<float>(),
        batch,
        channels,
        input_length,
        output_length,
        kernel_size,
        stride,
        padding,
        dilation
    );
    
    return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("max_pool1d", &max_pool1d_forward, "Optimized Custom CUDA MaxPool1d forward");
}