3.19. MobileNetV1 Forward Computation

Calculates image features through a sequence of operations that include standard convolution, depthwise separable convolution with batch normalization and ReLU activation, spatial downsampling using strided convolutions, global average pooling, and a final fully-connected transformation to compute classification scores.

Progress

Step 3 of 3

Reward

1.02x

Optimization Timeline

Step 3 Details

Reasoning

We observed that our fused kernel for the MobileNetV1 classifier (pooling and FC) can be further optimized by vectorizing the dot‐product calculation when the number of channels (C) is divisible by 4. In our final MobileNetV1, C is typically 1024, which is divisible by 4. Therefore, in phase 2 of the kernel we use a float4‐based loop to compute the dot product between the pooled features (stored in shared memory) and the corresponding FC weight row. This avoids a scalar loop over 1024 elements and minimizes the number of iterations by a factor of 4. We then add the bias to produce each output. In phase 1, each thread computes the average over the spatial dimensions for its assigned channels. This stage is lightweight because the spatial size (H×W) is small (typically 7×7 = 49); so a normal loop is acceptable. After these improvements our kernel should reduce the overall computation time, yielding a performance improvement over our previous version. We now provide the updated Python module and CUDA kernel below.

#include <torch/extension.h>
#include <c10/cuda/CUDAGuard.h>
#include <cuda_runtime.h>

// Fused kernel that combines average pooling and fully connected (FC) computation.
// Input x has shape: [batch, C, H, W].
// FC weight has shape: [num_classes, C] and bias has shape: [num_classes].
// For each sample b and each class j, we compute:
//
//    P[b, c] = (Σ_{i=0}^{H*W-1} x[b, c, i]) / (H*W)
//    output[b, j] = bias[j] + Σ_{c=0}^{C-1} P[b, c] * weight[j, c]
//
// Each block processes one sample. Shared memory is used to store the pooled features.
__global__ void fused_avgpool_fc_kernel(const float* __restrict__ x,
                                        const float* __restrict__ weight,
                                        const float* __restrict__ bias,
                                        float* __restrict__ output,
                                        int C, int H, int W, int num_classes) {
    int b = blockIdx.x;  // block per sample
    extern __shared__ float pooled[];  // shared memory for pooled features; size = C floats
    int spatial = H * W;
    
    // Phase 1: Compute pooled features for sample b.
    for (int c = threadIdx.x; c < C; c += blockDim.x) {
        float sum = 0.0f;
        int base = b * C * spatial + c * spatial;
        for (int i = 0; i < spatial; i++) {
            sum += x[base + i];
        }
        pooled[c] = sum / spatial;
    }
    __syncthreads();
    
    // Phase 2: Compute FC layer for sample b.
    // If C is divisible by 4, use vectorized dot product.
    bool use_vec = ((C & 3) == 0);
    
    for (int j = threadIdx.x; j < num_classes; j += blockDim.x) {
        float dot = 0.0f;
        if (use_vec) {
            int vecC = C >> 2;  // C/4
            const float4* pooled_vec = reinterpret_cast<const float4*>(pooled);
            const float4* weight_vec = reinterpret_cast<const float4*>(weight + j * C);
            for (int k = 0; k < vecC; k++) {
                float4 p = pooled_vec[k];
                float4 w = weight_vec[k];
                dot += p.x * w.x + p.y * w.y + p.z * w.z + p.w * w.w;
            }
        } else {
            for (int c = 0; c < C; c++) {
                dot += pooled[c] * weight[j * C + c];
            }
        }
        output[b * num_classes + j] = dot + bias[j];
    }
}

torch::Tensor forward(torch::Tensor x, torch::Tensor weight, torch::Tensor bias) {
    c10::cuda::CUDAGuard device_guard(x.device());
    
    // x shape: [batch, C, H, W]
    int batch = x.size(0);
    int C = x.size(1);
    int H = x.size(2);
    int W = x.size(3);
    int num_classes = weight.size(0);
    
    auto output = torch::empty({batch, num_classes}, x.options());
    
    int threads = 256;
    int blocks = batch;
    size_t shared_mem = C * sizeof(float);
    
    fused_avgpool_fc_kernel<<<blocks, threads, shared_mem, c10::cuda::getCurrentCUDAStream()>>>(
        x.data_ptr<float>(),
        weight.data_ptr<float>(),
        bias.data_ptr<float>(),
        output.data_ptr<float>(),
        C, H, W, num_classes);
    
    return output;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &forward, "Fused average pooling and fully connected operator with vectorized dot product");
}