1.22. Tanh Activation

Calculates the hyperbolic tangent (Tanh) activation element-wise on the input tensor, applying a non-linear transformation to the data.

Progress

Step 2 of 2

Reward

1.14x

Optimization Timeline

Step 2 Details

Reasoning

We can further optimize the Tanh activation by vectorizing the memory accesses. In our current implementation, each thread computes tanhf(x) on one float element. Instead, if the number of elements is divisible by 4 and the memory is suitably aligned (which is usually the case for tensors allocated by PyTorch), we can reinterpret the pointer as a float4 pointer and compute four tanh evaluations per thread. This reduces the number of global memory transactions and may improve the performance slightly. In our improved kernel we add two branches: one vectorized branch that uses a float4 version of the kernel and one fallback elementwise version if the number of elements is not a multiple of 4. We use c10::cuda::CUDAGuard and launch on c10::cuda::getCurrentCUDAStream(). The Python module remains largely the same except that it now loads the enhanced CUDA extension. Have my improvements genuinely helped? Yes, the vectorized approach reduces the overhead of kernel invocations and improves global memory coalescence. This is a genuine optimization rather than a workaround.

#include <torch/extension.h>
#include <c10/cuda/CUDAGuard.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <math.h>

// Standard elementwise kernel applying tanh activation.
__global__ void tanh_kernel(const float* __restrict__ input, float* __restrict__ output, int numel) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < numel) {
        output[idx] = tanhf(input[idx]);
    }
}

// Vectorized kernel using float4 to process four elements per thread.
__global__ void tanh_vectorized_kernel(const float4* __restrict__ input, float4* __restrict__ output, int numv) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < numv) {
        float4 in_val = input[idx];
        float4 out_val;
        out_val.x = tanhf(in_val.x);
        out_val.y = tanhf(in_val.y);
        out_val.z = tanhf(in_val.z);
        out_val.w = tanhf(in_val.w);
        output[idx] = out_val;
    }
}

torch::Tensor forward(torch::Tensor input) {
    // Ensure we operate on the correct CUDA device.
    c10::cuda::CUDAGuard device_guard(input.device());

    // Allocate output tensor with the same size as input.
    auto output = torch::empty_like(input);
    int numel = input.numel();
    int threads = 256;
    
    // Check if we can use the vectorized kernel.
    if (numel % 4 == 0 && input.is_contiguous() && output.is_contiguous()) {
        int numv = numel / 4;
        int blocks = (numv + threads - 1) / threads;
        const float4* input_vec = reinterpret_cast<const float4*>(input.data_ptr<float>());
        float4* output_vec = reinterpret_cast<float4*>(output.data_ptr<float>());
        tanh_vectorized_kernel<<<blocks, threads, 0, c10::cuda::getCurrentCUDAStream()>>>(
            input_vec, output_vec, numv);
    } else {
        int blocks = (numel + threads - 1) / threads;
        tanh_kernel<<<blocks, threads, 0, c10::cuda::getCurrentCUDAStream()>>>(
            input.data_ptr<float>(), output.data_ptr<float>(), numel);
    }
    return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &forward, "Apply Tanh activation using a custom CUDA kernel with vectorized support");
}