2.5. Transposed Convolution with Bias Subtraction and Tanh Activation
Performs a transposed convolution on input data with configurable upsampling parameters (stride, padding, output padding), subtracts a learnable bias tensor from the convolution result, and applies the hyperbolic tangent activation function.
Progress
Optimization Timeline
Step 1 Details
Reasoning
We want to speed up the model that performs a transposed convolution, subtracts a bias, and applies a tanh activation. Instead of letting the subtraction and tanh be computed in separate elementwise PyTorch kernels, we can fuse them into one custom CUDA kernel. Since the ConvTranspose2d operator is already highly optimized (often via cuDNN), we leave it as is. We then write a custom fused CUDA operator that takes the output from the transposed convolution (shape [N, C, H, W]), subtracts a bias per channel (the bias is provided with shape [C,1,1] or can be viewed as a vector of length C), and then applies tanhf elementwise. The custom CUDA kernel will launch a thread for each element and compute the output as: y = tanhf(x – bias[channel]) using the channel index computed by usual 1D indexing over the contiguous tensor. We use c10::cuda::CUDAGuard and c10::cuda::getCurrentCUDAStream() for proper device usage. The Python module loads the extension with -O3 and --use_fast_math flags. In the CUDAModel’s forward method, we first apply conv_transpose to the input and then pass the result along with the bias to our fused CUDA function. This fusion reduces memory roundtrips and should be faster than performing the bias subtraction and tanh separately. Before finishing, I ask: Have my improvements genuinely helped? Yes, fusing the pointwise operations into one kernel reduces kernel launch overhead and improves memory throughput, without resorting to a “cheat” approach.
#include <torch/extension.h>
#include <c10/cuda/CUDAGuard.h>
#include <cuda_runtime.h>
#include <math.h>
// Kernel that fuses bias subtraction and tanh activation.
// Input: input tensor of shape [N, C, H, W] (contiguous)
// Bias: bias vector of length C (each bias value is broadcast over H x W)
// Output: output tensor of the same shape.
__global__ void fused_bias_tanh_kernel(const float* __restrict__ input,
const float* __restrict__ bias,
float* __restrict__ output,
int N, int C, int H, int W) {
int total = N * C * H * W;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < total) {
// Compute indices: the fastest changing dimension is width.
int w = idx % W;
int tmp = idx / W;
int h = tmp % H;
int tmp2 = tmp / H;
int c = tmp2 % C;
// int n = tmp2 / C; // n not used explicitly.
// Subtract the corresponding bias element and apply tanh.
float val = input[idx] - __ldg(&bias[c]);
output[idx] = tanhf(val);
}
}
// Fused forward function called from Python.
// It takes the input tensor [N, C, H, W] and a bias vector of length C.
// It returns a new tensor where each element is tanhf(input - bias[channel]).
torch::Tensor fused_bias_tanh(torch::Tensor input, torch::Tensor bias) {
// Ensure the input is on a CUDA device.
c10::cuda::CUDAGuard device_guard(input.device());
// Check dimensions: input is expected to be 4D.
TORCH_CHECK(input.dim() == 4, "input must be a 4D tensor [N, C, H, W]");
// Check bias is 1D.
TORCH_CHECK(bias.dim() == 1, "bias must be a 1D tensor");
int N = input.size(0);
int C = input.size(1);
int H = input.size(2);
int W = input.size(3);
auto output = torch::empty_like(input);
int total = N * C * H * W;
int threads = 256;
int blocks = (total + threads - 1) / threads;
// Launch the kernel on current CUDA stream.
fused_bias_tanh_kernel<<<blocks, threads, 0, c10::cuda::getCurrentCUDAStream()>>>(
input.data_ptr<float>(),
bias.data_ptr<float>(),
output.data_ptr<float>(),
N, C, H, W);
return output;
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("fused_bias_tanh", &fused_bias_tanh, "Fused bias subtraction and tanh activation");
}