Evolutionary KernelBench Solutions (o3-mini)

Optimized CUDA kernels for a variety of tasks from KernelBench.

Level 1

70 tasks with speedup
  • 1. Square Matrix Multiplication

    Calculates the multiplication of two square matrices of equal dimensions (N x N), producing a new square matrix as the result.

  • 2. Standard Matrix Multiplication

    Calculates the product of two matrices by performing a matrix multiplication operation. Given an input matrix A of shape (M, K) and B of shape (K, N), it returns an output matrix of shape (M, N) representing the multiplication result.

  • 4. Matrix-Vector Multiplication

    Performs matrix-vector multiplication by multiplying an (M, K) matrix with a (K, 1) vector to produce an (M, 1) output vector.

  • And 67 more problems...

Level 2

96 tasks with speedup
  • 1. 2D Convolution with ReLU and Bias Addition

    Calculates a 2D convolution over input data, applies a ReLU activation to the result, and then adds a bias term via element-wise addition.

  • 2. Transposed Convolution with Bias, Clamping, and Scaling

    Performs a transposed convolution on 2D input data, then adds a bias. The result is clamped within a fixed range, scaled by a specified factor, clamped again, and finally normalized by dividing by the same scaling factor.

  • 3. 3D Transposed Convolution with Sum, LayerNorm, AvgPool, and GELU

    Performs a 3D transposed convolution to upsample the input using customizable kernel, stride, padding, and output padding; adds a learnable scalar weight to the convolutional output; applies layer normalization to standardize the activations; reduces spatial dimensions via 3D average pooling; and finally transforms the data using the Gaussian Error Linear Unit (GELU) activation.

  • And 93 more problems...

Level 3

39 tasks with speedup
  • 1. Multi-Layer Perceptron Computation

    Calculates the output of a multi-layer perceptron by sequentially applying linear transformations and ReLU activations to an input tensor, transforming it from a specified input size through hidden layers to a designated output size.

  • 2. Shallow Wide MLP Computation

    Calculates the output of a shallow feed-forward network by sequentially applying dense linear transformations interleaved with ReLU activations. The operation transforms an input tensor through multiple layers, each performing a matrix multiplication and bias addition, and finally maps the last hidden layer to the output.

  • 3. Deep Narrow MLP Computation

    Calculates a forward pass through a multi-layer perceptron that successively applies linear transformations and ReLU activations. The operation transforms a high-dimensional input into a lower-dimensional output via a chain of deep, narrow hidden layers.

  • And 36 more problems...

Level 4

3 tasks with speedup
  • 6. Causal Transformer Logit Computation

    Calculates logits for a batch of token sequences by performing a forward pass through a pretrained transformer configured for causal language modeling. The computation uses a random input sequence of 1023 tokens with a single batch instance and outputs the resulting prediction scores.

  • 9. BigBird-Roberta Logits Computation

    Calculates output logits by performing a forward pass using a pre-trained BigBird-Roberta configuration on a batch of 32 randomly generated token sequences, each of length 256.

  • 11. Electra Small Discriminator Logits Computation

    Calculates output logits by processing a sequence of token IDs through a transformer-based language model. The operation initializes the model using a pre-trained configuration and performs a forward pass on a randomly generated input sequence to produce prediction scores.