GPU MODE -- Past Competition History & Winner Analysis

Research Report

Last updated: 2026-03-19 Sources: Web searches across X/Twitter, NVIDIA blogs, SemiAnalysis, Accel, personal blogs, HuggingFace, LinkedIn, AMD developer pages, academic papers (KernelBot/OpenReview)

Community Background

GPU MODE (formerly CUDA MODE) is an open-source GPU programming community founded by Mark Saroufim (PyTorch maintainer at Meta) and Andreas Kopf. As of early 2026:

~26K Discord members
~26K YouTube subscribers
92 lecture recordings
400K+ KernelBot submissions across all competitions
3 events with major sponsors in 2025 alone (NVIDIA, Jane Street, Accel)
10 active working groups

The community's philosophy centers on making GPU programming accessible. Competitions serve as both learning tools and performance benchmarks. Mark Saroufim noted with surprise that "one of the top submitters in the NVFP4 competition has never hand-written GPU code before."

Competition History (Chronological)

1. CUDA MODE IRL #1 -- "The First IRL" (September 2024, San Francisco)

Field	Detail
Date	Last weekend of September 2024
Location	San Francisco (hosted by Accel)
Collaborators	Accel, PyTorch, NVIDIA
Duration	14 hours (10 AM - midnight)
Participants	~200 developers from across the globe
Teams	40+ submitted projects
Hardware	NVIDIA GPUs (generation not specified; RTX 4080 prizes)
Prize Pool	~$40K in compute credits + 3x RTX 4080 GPUs signed by Jensen Huang
Sponsors	Anyscale, Fal.ai, Lambda Labs, Modal Labs, Nebius.ai, Oracle, Prime Intellect, Together.ai
Format	In-person hackathon, open-ended project submissions

Winners (top 3 projects):

Flexible attention masks in CUTLASS -- extending CUTLASS with configurable attention mask support
NCCL rewrite using Triton -- reimplementing collective communication primitives in Triton
PyTorch binaries without libtorch -- stripping the C++ runtime dependency from PyTorch

Keynote Speakers:

Tri Dao (Together.ai): Flash Attention 3, warp group MMA, Tensor Memory Accelerator
Andrej Karpathy (Eureka Labs): llm.c and reference architectures
Supriya Rao (PyTorch): Quantization and sparsity via torchao
Lily Liu (vLLM): High-performance LLM inference and speculative decoding
Tim Dettmers: Open source vs closed source GPU programming
Wen-mei Hwu (NVIDIA): Selecting challenging long-term research problems

Key observation: This was the first in-person event. Format was open-ended (build anything GPU-related), not a kernel speed competition. Judging was project-based, not leaderboard-based. The projects that won were infrastructure/framework contributions, not raw kernel speed.

2. GPU MODE Kernel Leaderboard Launch (February 2025)

The permanent online leaderboard launched in early 2025, hosting company-sponsored kernel speed competitions. This shifted GPU MODE from project-based hackathons to pure performance competitions.

Platform: KernelBot (academic paper published at ICML 2025 workshop)

Open-source competition platform
Submissions via Discord bot or Popcorn CLI
Automated correctness verification + benchmarking
All submissions open-sourced under permissive license at competition end
Results benchmarked across multiple GPU architectures

Initial problem set: PMPP (Programming Massively Parallel Processors) textbook problems

PrefixSum, VectorAdd, Histogram, Sort, Grayscale
Benchmarked on: NVIDIA B200, H100, A100, L4
Winner: NVIDIA CCCL team (username "Nader") using cuda.compute library
Achieved most first-place finishes across all GPU architectures
Sort implementation was 2-4x faster than the next best submission
Used JIT-compiled CUB primitives with link-time optimization

3. AMD DeepSeek Kernels Competition -- Round 1 ($100K, ~April 2025)

Field	Detail
Prize Pool	$100,000 cash
Hardware	AMD MI300X (provided free via GPU MODE Discord)
Theme	LLM inference kernels for DeepSeek-V3 on AMD
Format	Online, leaderboard-based via KernelBot
Submissions	~40K kernels submitted

Problems:

FP8 Block-wise GEMM
Single-GPU Mixture of Experts (MoE)
Multi-Latent Attention (MLA)

Key result: The top submission for FP8 Matmul on AMD MI300 was faster than the optimized AMD in-house kernel on some matrix shapes used directly in DeepSeek-V3. This demonstrated that community-driven optimization can exceed vendor kernels.

Winner details: Specific winners not publicly announced in detail at time of research. AMD selected 4 winning submissions after reproduction and verification. Select participants invited to AMD DevDay 2025 during Open-Source AI Week.

4. AMD Distributed Kernels Competition -- Round 2 ($100K, ~2025)

Field	Detail
Prize Pool	$100,000 cash
Hardware	AMD Instinct GPUs (multi-GPU)
Theme	Multi-GPU communication kernels for LLM training
Format	Online, leaderboard-based via KernelBot

Problems (weighted scoring):

All-to-All kernel (1500 points max)
GEMM + ReduceScatter (1250 points max)
AllGather + GEMM (1000 points max)

Scoring: Top 10 performant kernels considered per problem. 4 winners selected by AMD after reproduction/verification. Leaderboard placement not a guarantee of winning.

5. TriMul / AlphaFold3 Competition (~2025)

Field	Detail
Prize Pool	Merch (no cash)
Hardware	NVIDIA H100, A100, B200; AMD MI300X
Theme	Triangular Multiplicative Update from AlphaFold3
Format	Online, leaderboard-based

Key techniques by winners:

Fusing elementwise operations (input LayerNorm, sigmoid gating, output LayerNorm+gating)
Converting inputs to FP16 and delegating to cuBLAS/rocBLAS for TensorCore/MatrixCore utilization
Identifying heavy memory I/O from frequent elementwise operations as the bottleneck

Notable result: K-Search (automated kernel generation system) achieved state-of-the-art at 1028 microseconds on H100, surpassing both prior automated and human-designed solutions.

6. Modular x GPU MODE Mojo Hackathon (May 10, 2025, AGI House)

Field	Detail
Date	May 10, 2025
Location	AGI House, Hillsborough, CA
Participants	100+ engineers and researchers
Hardware	AMD MI300X (via Crusoe Cloud)
Sponsors	Modular, AMD, Crusoe, GPU MODE
Format	In-person hackathon

Notable winners:

Marcel Roed, Herman Brunborg, Rajat Vadiraj Dwaraknath (all PhD students at Stanford): Built a training framework in Mojo/MAX, implementing kernels and backward passes to train a Transformer model from scratch, including backpropagation and AdamW optimizer.

Speakers: Chris Lattner (Modular), Dylan Patel (SemiAnalysis), representatives from AMD and Anthropic.

7. SemiAnalysis Blackwell Hackathon / GPU MODE IRL #2 (~Late 2025)

Field	Detail
Location	Accel office, San Francisco
Hardware	NVIDIA Blackwell GB200 (provided by Nebius AI)
Format	In-person hackathon, project-based judging
Sponsors	Together.ai, Lambda Labs, Google Cloud, NVIDIA, GPU MODE, Thinking Machines, OpenAI, PyTorch, CoreWeave, Nebius

Winners:

Place	Team	Project	Result
1st	Symmetric Minds	Multi-GPU expert-parallel MoE inference in vLLM with symmetric memory pool + Triton kernel refinements	~1.5x speedup over DeepEP on GB200
2nd	Flash Hogs	Flash-HOG: custom Blackwell kernel for higher-order attention gradients	~3x speed, linear memory cost vs XLA, validated to 1M tokens
3rd	Delta Net	Context-parallel Gated DeltaNet	Scaling to 128K+ context on single GB200 node with 8B hybrid GDN-Transformer
4th	KernelEvolve	LLM-driven search + RL to auto-tune Helion GPU kernels	Tuning time from hours to <30 min
Special 1st (SemiAnalysis)	Arun Demeure	Optimizing Blackwell's Split L2	100x+ reduction in cross-chip data transfer and power consumption

Speakers:

Mark Saroufim (GPU MODE): "How to Make an Impact in ML Systems"
Vijay Thakkar (NVIDIA CUTLASS): Tensor core programming
Horace He (Thinking Machines): Large-scale ML systems
Philippe Tillet (OpenAI): Triton programming framework
Tri Dao (Together.ai): Attention mechanism optimization

8. Blackwell NVFP4 Kernel Hackathon (November 2025 -- February 2026)

Field	Detail
Organizers	NVIDIA + GPU MODE
Hardware	NVIDIA B200 (Blackwell); DGX B200 compute via Sesterce
Duration	~3 months (Nov 2025 -- Feb 2026)
Format	Online, 4 sequential kernel problems, leaderboard-based
Prizes	Per problem: 1st DGX Spark + GTC 2026 pass; 2nd RTX 5090 + GTC pass; 3rd RTX 5080. Grand prize for overall fastest (weighted sum). Top 2 per problem invited to GTC awards ceremony.

Problems:

Batched GEMV (NVFP4 format)
GEMM (NVFP4 format)
Gated Dual GEMM (with SiLU activation)
Grouped GEMM

Scoring methodology: Geometric mean of benchmark times against a "speed-of-light" baseline derived from max(FFMA math throughput, DRAM memory throughput) on B200 at 1.5 GHz.

What winning kernels used (from participant blogs):

Direct PTX assembly (not C intrinsics) for precise instruction control
Specialized cache policies: L1::no_allocate for streaming data, L1::evict_last for reusable vectors
Wider vectorized loads (128-256 bit) with byte-level PTX unpacking
Compile-time specialization per exact K dimension
Aggressive register budgeting (32-45 registers vs naive 80)
Shared vector loading across multiple rows per block
Hardware intrinsics for FP4/FP8 decoding (__nv_cvt_fp4x2_to_halfraw2)

What did NOT work:

Double buffering with async copy (no improvement for GEMV)
Loading entire B matrix into shared memory
Processing 3-4 tiles per loop iteration (regression)

Key lesson from a participant: "The single most important thing I could have done was run Nsight Compute" to identify whether the bottleneck was memory or compute. Five optimization attempts failed because they targeted compute improvements on a memory-bandwidth-limited kernel.

Surprising finding: A pure PyTorch solution using torch._scaled_mm with cuBLAS FP4 backend achieved 22.4 microseconds -- within striking distance of hand-optimized kernels. One top submitter had never hand-written GPU code before.

9. Jane Street x GPU MODE NYC Hackathon (~February 2026)

Field	Detail
Location	New York City (Jane Street office)
Prize Pool	$50,000 grand prize (highest single prize)
Format	In-person, 1-day hackathon, leaderboard-based
Theme	Model optimization
GPU hours consumed	26,000

Winners:

$50K Grand Prize: Nadav Timor and team
1st Place (separate track): Kyle Yu and team

Speakers/Panelists:

Mark Saroufim (GPU MODE) -- keynote
Tri Dao -- keynote
Panel: Soumith Chintala, Sam Gross, Gregory Chanan (PyTorch tech leads)

Note: Jane Street talk "Making GPUs Actually Fast: A Deep Dive into Training Performance" was released alongside this event.

10. AMD x GPU MODE E2E Model Speedrun (March -- May 2026, CURRENT)

Field	Detail
Prize Pool	$1,100,000 total cash
Hardware	AMD Instinct MI355X (CDNA4)
Format	Online, 2 phases: qualifier (kernel leaderboard) + finals (end-to-end inference)
Qualifier deadline	Registration: March 30; Submissions: April 6
Finals models	Track 1: DeepSeek-R1-0528 ($350K); Track 2: Kimi K2.5 1T ($650K)

Qualifier problems:

MXFP4 GEMM (1000 pts)
MLA Decode (1250 pts)
MXFP4 MoE (1500 pts)

Covered in detail in gpu-mode-entry-guide.md and gpu-mode-advanced-strategy.md.

11. GPU MODE IRL Paris -- PyTorch Conference Europe (April 9, 2026, UPCOMING)

Field	Detail
Date	April 9, 2026 (side event to PyTorch Conference Europe April 7-8)
Location	Paris, France
Hardware	NVIDIA Blackwell Ultra (via Verda) + Hopper (via Sesterce)
Theme	LLM speedrun: pre-train an LLM from scratch on Blackwell Ultra cluster (360 PFLOP/s BF16) + ML systems optimization tasks

Winning Project Patterns

What approaches won?

IRL hackathons (project-based):

Systems-level contributions that integrate into real frameworks (vLLM, PyTorch, CUTLASS)
Novel architecture adaptations to new hardware (Blackwell Split L2, MoE on GB200)
Algorithmic innovations (context-parallel attention, LLM-driven kernel tuning)
NOT raw kernel speed -- creativity, novelty, and practical impact matter more

Online leaderboard competitions (speed-based):

Hand-written CUDA/HIP with PTX assembly for critical paths
Hardware intrinsics over manual bit manipulation
Aggressive memory access optimization (coalesced loads, cache policies, vectorization)
Profiler-driven iteration (Nsight Compute / rocprof-compute)
High submission counts (top competitors: 500-3600+ submissions)

Common techniques across winners

Profiling first, optimizing second -- identifying whether bottleneck is memory or compute before choosing optimization strategy
Hardware intrinsics -- using vendor-provided intrinsics rather than reimplementing format conversion
Register pressure management -- winning kernels use 32-45 registers vs naive 80
Aggressive fusion -- fusing elementwise ops, normalization, gating, and compute into single kernels
Systematic iteration -- hundreds to thousands of submissions, varying one parameter at a time
Leveraging vendor libraries as baselines -- cuBLAS/hipBLASLt as floor, then exceeding with architecture-specific tuning

What surprised judges?

Beginners competing with experts: A top NVFP4 competitor had never written GPU code before (Mark Saroufim was "surprised")
LLM-generated kernels being competitive: K-Search (automated) beat human solutions on TriMul
Library-based solutions matching hand-tuned: cuda.compute (high-level Python API) topped the PMPP leaderboard; torch._scaled_mm scored within range of hand-written NVFP4 kernels
Framework contributions over raw speed: IRL #1 winners built infrastructure tools, not the fastest kernels

Speed improvements achieved

Competition	Baseline	Winning	Improvement
NVFP4 GEMV	~100 us (CuTe DSL)	~22 us (CUDA)	~4.5x
AMD FP8 GEMM	AMD in-house kernel	Community submission	Faster on some shapes
IRL #2 MoE inference	DeepEP baseline	Symmetric Minds	~1.5x
IRL #2 attention gradients	XLA baseline	Flash-HOG	~3x
CDNA4 GEMM optimization stack	1.15 TFLOPS (naive)	2,680 TFLOPS (optimized)	~2,330x
Blackwell Split L2	Standard reduction	Arun Demeure's kernel	100x+ data transfer reduction

Judging Criteria Differences

IRL Hackathons (project-based)

Creativity and novelty of the approach
Practical impact -- does it solve a real problem? Could it be merged into a real framework?
Technical depth -- how well do you understand the hardware?
Presentation quality -- ability to explain and demo in ~5 minutes
Judging by a panel of experts (speakers, sponsors, organizers)
Not purely speed -- a creative infrastructure contribution beats a marginally faster kernel

Online Leaderboard Competitions (speed-based)

Purely speed -- geometric mean of benchmark times across test cases
Compared against a "speed of light" baseline (theoretical hardware maximum)
Correctness verification required (automated)
Multiple GPU architectures tested (B200, H100, A100, L4 for NVIDIA; MI300X, MI355X for AMD)
Final winners may be selected by sponsor after reproduction/verification (AMD competitions)
Leaderboard placement is not always a guarantee of prize (AMD disclaimer)

"Speed of Light" Analysis in Practice

Calculate theoretical peak: max(compute_throughput, memory_bandwidth) for the specific operation
For GEMM: 2 * M * N * K / peak_FLOPS (compute-bound) or (M*K + K*N + M*N) * bytes / peak_bandwidth (memory-bound)
Real-world achievable: typically 53-58% of theoretical peak; hipBLASLt/cuBLAS achieve ~97% for large GEMMs
The gap between your kernel and the SoL is your optimization opportunity
Winners typically get within 2x of SoL; top competitors within 1.5x

Winner Profiles

Known Top Competitors

Name/Team	Background	Notable Wins
"Nader" (NVIDIA CCCL team)	Industry (NVIDIA)	PMPP leaderboard top across 4 GPU architectures
Arun Demeure	Independent/industry	SemiAnalysis hackathon 1st place (Blackwell Split L2)
Symmetric Minds (team)	Unknown (likely industry/research)	IRL #2 1st place (MoE inference on GB200)
Flash Hogs (team)	Unknown	IRL #2 2nd place (Flash-HOG attention gradients)
Delta Net (team)	Unknown	IRL #2 3rd place (context-parallel Gated DeltaNet)
KernelEvolve (team)	Unknown (LLM+RL approach)	IRL #2 4th place (automated kernel tuning)
Nadav Timor	Unknown	Jane Street $50K grand prize winner
Kyle Yu	Unknown	Jane Street hackathon 1st place
Marcel Roed, Herman Brunborg, Rajat V. Dwaraknath	PhD students, Stanford	Modular/Mojo hackathon winners
"Danishlynx"	Unknown	Current AMD E2E Speedrun leader (top in 2/3 kernels)

Academic vs Industry vs Indie Breakdown

Based on available data, the competitor pool is diverse:

Industry engineers (NVIDIA, AMD, Meta, etc.) have a natural advantage from daily exposure to GPU hardware. The NVIDIA CCCL team dominated the PMPP leaderboard.
PhD students (Stanford, etc.) have won Mojo and project-based hackathons with novel research contributions.
Independent/indie developers have won speed competitions. One NVFP4 top performer had no prior GPU code experience.
Teams using LLM-assisted approaches (KernelEvolve, K-Search) are emerging as a new category, sometimes beating humans.

Pattern: IRL hackathons tend to be won by people with deep systems knowledge (able to build complete systems in 14 hours). Online leaderboards are more accessible to newcomers who can iterate via brute-force submission.

Team Sizes That Win

IRL hackathons: Small teams of 2-4 people
Online leaderboards: Often individuals (the leaderboard tracks individual submissions)
Current AMD competition: Teams of up to 3 allowed; unclear if top performers are solo or teams

Evolution of Difficulty

2024: "Build Something Cool"

CUDA MODE IRL #1 was open-ended: build any GPU project
Low bar for participation; 200 attendees, 40+ teams
Winners were framework contributions, not optimized kernels
Community was ~10K members

Early 2025: "Write a Fast Kernel" (Standard Problems)

PMPP leaderboard: textbook problems (vector add, sort, prefix sum)
These are well-understood algorithms with known optimal solutions
NVIDIA's CUB library could top the leaderboard out of the box
Bar: know your GPU programming fundamentals

Mid 2025: "Write a Fast Production Kernel" ($100K Competitions)

AMD DeepSeek competition: real LLM inference kernels (FP8 GEMM, MoE, MLA)
These require understanding both the algorithm AND the specific hardware
Community submissions beat AMD's own in-house kernels on some shapes
Bar: deep understanding of specific GPU microarchitecture + algorithm expertise

Late 2025: "Write a Fast Bleeding-Edge Kernel" (Blackwell NVFP4)

New hardware (Blackwell), new format (NVFP4), new instructions
No existing optimized libraries to copy from
Required PTX assembly and undocumented hardware behavior
Bar: hardware reverse-engineering + low-level optimization mastery

2026: "Optimize an Entire Model End-to-End" ($1.1M)

AMD E2E Speedrun: not just kernels, but full inference optimization
MXFP4 format on MI355X (CDNA4): brand-new hardware with limited documentation
Multiple interacting kernels (GEMM + MoE + MLA) scored via geometric mean
Finals: end-to-end DeepSeek-R1 and Kimi K2.5 inference
Bar: systems engineering across the full stack, not just single-kernel optimization

Summary: The bar has risen from "write any GPU code" to "optimize bleeding-edge operations on unreleased hardware, end-to-end, faster than vendor libraries."

Lessons for the Current E2E Speedrun

What can we learn from past winners' approaches?

Profile before optimizing. The NVFP4 competitor who wasted 5 attempts on compute optimizations for a memory-bound kernel is a cautionary tale. Run rocprof-compute first.

Start with vendor baselines. cuBLAS/hipBLASLt/AITER provide strong floors. In the NVFP4 competition, torch._scaled_mm achieved near-competitive results. In the AMD competition, AITER's assembly kernels are the starting point for MLA Decode.

Geometric mean scoring punishes imbalance. You MUST submit to all three kernels. Being top 5 in one kernel but bottom 50% in others loses to someone top 20 in all three. Past AMD competitions used similar weighted scoring.

High submission count correlates with placement. Top performers submit 500-3600+ times. The iteration loop (profile -> hypothesize -> implement -> test -> benchmark -> submit) is the core skill.

Hand-written HIP/CUDA with assembly beats Triton for GEMMs. Triton is 30-50% slower for GEMM due to inability to express ping-pong scheduling or GLOBALLOADLDS. But Triton is viable for MLA and MoE dispatch logic.

Hardware intrinsics matter more than clever algorithms. Using __nv_cvt_fp4x2_to_halfraw2 instead of manual bit manipulation was the difference between competitive and not in NVFP4. For AMD, the MFMA 16x16x128 instruction with FP4 inputs is the equivalent.

Register pressure is a hidden killer. Winning NVFP4 kernels used 32-45 registers; naive implementations used 80. On CDNA4, register pressure affects occupancy and ping-pong scheduling.

Community beats vendor (sometimes). AMD's own in-house FP8 GEMM kernel was beaten by community submissions on some shapes. This means there IS room to exceed existing baselines.

Newcomers can compete. One NVFP4 top performer had no GPU coding experience. Modern tools (cuda.compute, Triton, torch.scaledmm) lower the barrier significantly.

For IRL events: build something that matters. Winners built tools that could be merged into vLLM, PyTorch, or CUTLASS -- not academic exercises.

Common mistakes to avoid

Optimizing compute on a memory-bound kernel (or vice versa). Profile first.
Neglecting one kernel in geometric-mean scoring. A zero in any kernel = zero overall.
Over-engineering early. Start with a working baseline, submit it, then iterate.
**Not using --mode test before --mode benchmark.** Silent correctness failures waste time.
Ignoring scale factor shuffling for MXFP4. Getting this wrong produces silently wrong results.
Trying to implement everything from scratch instead of building on AITER/CK-Tile/hipBLASLt.
Not tracking submissions systematically. Track: timestamp, kernel, score, what changed.
Spending too long reading papers before submitting. The top competitors learn by iterating on submissions, not by reading.

All GPU MODE Competition Events Summary

#	Event	Date	Format	Hardware	Prize Pool	Key Winners
1	CUDA MODE IRL #1	Sep 2024	In-person, project	NVIDIA GPUs	~$40K credits + RTX 4080s	Flexible attention masks, NCCL in Triton
2	PMPP Leaderboard	Feb 2025	Online, speed	B200/H100/A100/L4	Recognition	NVIDIA CCCL ("Nader")
3	AMD DeepSeek Kernels R1	~Apr 2025	Online, speed	MI300X	$100K cash	Community beat AMD in-house on FP8 GEMM
4	AMD Distributed Kernels R2	~2025	Online, speed	MI300X (multi)	$100K cash	4 winners selected by AMD
5	TriMul / AlphaFold3	~2025	Online, speed	H100/A100/B200/MI300X	Merch	K-Search (automated) beat humans
6	Modular Mojo Hackathon	May 2025	In-person, project	MI300X	Unknown	Stanford PhD team (Transformer training in Mojo)
7	SemiAnalysis / IRL #2	Late 2025	In-person, project	GB200 (Blackwell)	Credits + recognition	Symmetric Minds, Arun Demeure
8	NVFP4 Kernel Hackathon	Nov 2025-Feb 2026	Online, speed	B200	DGX Spark/RTX 5090/5080 per problem	PTX assembly specialists
9	Jane Street x GPU MODE	~Feb 2026	In-person, speed	Unknown	$50K grand prize	Nadav Timor ($50K), Kyle Yu (1st)
10	AMD E2E Speedrun	Mar-May 2026	Online, speed+E2E	MI355X	$1.1M cash	IN PROGRESS
11	Paris IRL (PyTorch Conf)	Apr 9, 2026	In-person, project	Blackwell Ultra + Hopper	TBD	UPCOMING