2.20. 3D Transposed Convolution with Residual Arithmetic

Performs a 3D transposed convolution on an input tensor, then applies a learned bias and several arithmetic operations. First, it sums the convolution output with the bias and a residual copy of the output, then multiplies the result element-wise with the residual, and finally adds the residual copy once more.

Progress

Step 3 of 3

Reward

2.32x

Optimization Timeline

Step 3 Details

Reasoning

Our current fused kernel is already highly optimized using vectorized loads (float4) when possible and scalar processing otherwise. However, we can attempt to further improve our vectorized kernel by processing multiple float4 elements per thread (if the workload is large enough) to reduce kernel launch overhead and better hide memory latency. In many deep learning applications, the spatial dimension is large; having each thread process a contiguous block (e.g. two float4’s) by looping over a stride can potentially reduce the number of thread index computations and improve throughput. We will add an inner loop in the vectorized kernel so that if there are several float4’s per thread, the thread will process them sequentially. This mini-loop unrolling should help when the total number of vectorized elements is large, while still falling back to the scalar kernel if spatial is not divisible by 4. We must adjust the total number of iterations per thread in the vectorized kernel accordingly. We introduce a parameter "vec_stride" (e.g., each thread will process multiple float4 elements in a strided loop) based on a fixed load factor. For instance, we can let each thread process 'L' float4 elements if available. We then update the kernel accordingly and adjust the host function dispatching. The Python module does not need changes because it simply passes the tensor dimensions, and the kernel dispatch will now use the revised vectorized kernel when possible. These improvements genuinely help by reducing the number of kernel launches and loop overhead when the number of vectorized elements is high, while not “cheating” because we are still correctly performing the same fused element-wise operations.

#include <torch/extension.h>
#include <c10/cuda/CUDAGuard.h>
#include <cuda.h>
#include <cuda_runtime.h>

// We define a vectorization factor; each thread will process 'L' float4 elements if possible.
#ifndef VEC_LOAD_FACTOR
#define VEC_LOAD_FACTOR 2  // Each thread processes 2 float4 elements in a loop.
#endif

// Scalar kernel: processes one element at a time.
__global__ void fused_kernel(const float* __restrict__ input,
                             const float* __restrict__ bias,
                             float* __restrict__ output,
                             int total, int C, int spatial) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < total) {
        int channel = (idx / spatial) % C;
        float y = input[idx];
        output[idx] = 2.0f * y * y + y * (bias[channel] + 1.0f);
    }
}

// Enhanced vectorized kernel using float4 with an inner loop to process multiple elements per thread.
__global__ void fused_vec_kernel(const float* __restrict__ input,
                                 const float* __restrict__ bias,
                                 float* __restrict__ output,
                                 int total_vec, int C, int vec_per_channel) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // Each thread will process VEC_LOAD_FACTOR float4 elements.
    int stride = gridDim.x * blockDim.x;
    for (int i = idx; i < total_vec; i += stride * VEC_LOAD_FACTOR) {
        #pragma unroll
        for (int l = 0; l < VEC_LOAD_FACTOR; l++) {
            int index = i + l * (stride);
            if (index < total_vec) {
                // Calculate channel index; each channel contributed vec_per_channel float4 elements.
                int channel = (index / vec_per_channel) % C;
                float4 in_vec = ((const float4*)input)[index];
                float b_val = __ldg(&bias[channel]);
                float4 out_vec;
                out_vec.x = 2.0f * in_vec.x * in_vec.x + in_vec.x * (b_val + 1.0f);
                out_vec.y = 2.0f * in_vec.y * in_vec.y + in_vec.y * (b_val + 1.0f);
                out_vec.z = 2.0f * in_vec.z * in_vec.z + in_vec.z * (b_val + 1.0f);
                out_vec.w = 2.0f * in_vec.w * in_vec.w + in_vec.w * (b_val + 1.0f);
                ((float4*)output)[index] = out_vec;
            }
        }
    }
}

torch::Tensor forward(torch::Tensor input, torch::Tensor bias, int total, int C, int spatial) {
    c10::cuda::CUDAGuard device_guard(input.device());
    auto output = torch::empty_like(input);
    int threads = 256;
    
    // If spatial is divisible by 4, use vectorized kernel.
    if (spatial % 4 == 0) {
        int total_vec = total / 4;          // Number of float4 elements.
        int vec_per_channel = spatial / 4;    // Number of float4 per channel.
        int blocks = (total_vec + threads - 1) / threads;
        fused_vec_kernel<<<blocks, threads, 0, c10::cuda::getCurrentCUDAStream()>>>(
            input.data_ptr<float>(),
            bias.data_ptr<float>(),
            output.data_ptr<float>(),
            total_vec, C, vec_per_channel
        );
    } else {
        int blocks = (total + threads - 1) / threads;
        fused_kernel<<<blocks, threads, 0, c10::cuda::getCurrentCUDAStream()>>>(
            input.data_ptr<float>(),
            bias.data_ptr<float>(),
            output.data_ptr<float>(),
            total, C, spatial
        );
    }
    return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &forward, "Fused element-wise operations after ConvTranspose3d with enhanced vectorization");
}