Evolutionary KernelBench Solutions (o3-mini)

Optimized CUDA kernels for a variety of tasks from KernelBench.

Level 1

70 tasks with speedup

1. Square Matrix Multiplication
Calculates the multiplication of two square matrices of equal dimensions (N x N), producing a new square matrix as the result.
2. Standard Matrix Multiplication
Calculates the product of two matrices by performing a matrix multiplication operation. Given an input matrix A of shape (M, K) and B of shape (K, N), it returns an output matrix of shape (M, N) representing the multiplication result.
4. Matrix-Vector Multiplication
Performs matrix-vector multiplication by multiplying an (M, K) matrix with a (K, 1) vector to produce an (M, 1) output vector.
And 67 more problems...

96 tasks with speedup

1. 2D Convolution with ReLU and Bias Addition
Calculates a 2D convolution over input data, applies a ReLU activation to the result, and then adds a bias term via element-wise addition.
2. Transposed Convolution with Bias, Clamping, and Scaling
Performs a transposed convolution on 2D input data, then adds a bias. The result is clamped within a fixed range, scaled by a specified factor, clamped again, and finally normalized by dividing by the same scaling factor.
3. 3D Transposed Convolution with Sum, LayerNorm, AvgPool, and GELU
Performs a 3D transposed convolution to upsample the input using customizable kernel, stride, padding, and output padding; adds a learnable scalar weight to the convolutional output; applies layer normalization to standardize the activations; reduces spatial dimensions via 3D average pooling; and finally transforms the data using the Gaussian Error Linear Unit (GELU) activation.
And 93 more problems...

39 tasks with speedup

1. Multi-Layer Perceptron Computation
Calculates the output of a multi-layer perceptron by sequentially applying linear transformations and ReLU activations to an input tensor, transforming it from a specified input size through hidden layers to a designated output size.
2. Shallow Wide MLP Computation
Calculates the output of a shallow feed-forward network by sequentially applying dense linear transformations interleaved with ReLU activations. The operation transforms an input tensor through multiple layers, each performing a matrix multiplication and bias addition, and finally maps the last hidden layer to the output.
3. Deep Narrow MLP Computation
Calculates a forward pass through a multi-layer perceptron that successively applies linear transformations and ReLU activations. The operation transforms a high-dimensional input into a lower-dimensional output via a chain of deep, narrow hidden layers.
And 36 more problems...

3 tasks with speedup

6. Causal Transformer Logit Computation
Calculates logits for a batch of token sequences by performing a forward pass through a pretrained transformer configured for causal language modeling. The computation uses a random input sequence of 1023 tokens with a single batch instance and outputs the resulting prediction scores.
9. BigBird-Roberta Logits Computation
Calculates output logits by performing a forward pass using a pre-trained BigBird-Roberta configuration on a batch of 32 randomly generated token sequences, each of length 256.
11. Electra Small Discriminator Logits Computation
Calculates output logits by processing a sequence of token IDs through a transformer-based language model. The operation initializes the model using a pre-trained configuration and performs a forward pass on a randomly generated input sequence to produce prediction scores.