← Back to Dashboard

GPU MODE -- Past Competition History & Winner Analysis

Research Report

Last updated: 2026-03-19 Sources: Web searches across X/Twitter, NVIDIA blogs, SemiAnalysis, Accel, personal blogs, HuggingFace, LinkedIn, AMD developer pages, academic papers (KernelBot/OpenReview)


Community Background

GPU MODE (formerly CUDA MODE) is an open-source GPU programming community founded by Mark Saroufim (PyTorch maintainer at Meta) and Andreas Kopf. As of early 2026:

The community's philosophy centers on making GPU programming accessible. Competitions serve as both learning tools and performance benchmarks. Mark Saroufim noted with surprise that "one of the top submitters in the NVFP4 competition has never hand-written GPU code before."


Competition History (Chronological)

1. CUDA MODE IRL #1 -- "The First IRL" (September 2024, San Francisco)

FieldDetail
DateLast weekend of September 2024
LocationSan Francisco (hosted by Accel)
CollaboratorsAccel, PyTorch, NVIDIA
Duration14 hours (10 AM - midnight)
Participants~200 developers from across the globe
Teams40+ submitted projects
HardwareNVIDIA GPUs (generation not specified; RTX 4080 prizes)
Prize Pool~$40K in compute credits + 3x RTX 4080 GPUs signed by Jensen Huang
SponsorsAnyscale, Fal.ai, Lambda Labs, Modal Labs, Nebius.ai, Oracle, Prime Intellect, Together.ai
FormatIn-person hackathon, open-ended project submissions

Winners (top 3 projects):

  1. Flexible attention masks in CUTLASS -- extending CUTLASS with configurable attention mask support
  2. NCCL rewrite using Triton -- reimplementing collective communication primitives in Triton
  3. PyTorch binaries without libtorch -- stripping the C++ runtime dependency from PyTorch

Keynote Speakers:

Key observation: This was the first in-person event. Format was open-ended (build anything GPU-related), not a kernel speed competition. Judging was project-based, not leaderboard-based. The projects that won were infrastructure/framework contributions, not raw kernel speed.


2. GPU MODE Kernel Leaderboard Launch (February 2025)

The permanent online leaderboard launched in early 2025, hosting company-sponsored kernel speed competitions. This shifted GPU MODE from project-based hackathons to pure performance competitions.

Platform: KernelBot (academic paper published at ICML 2025 workshop)

Initial problem set: PMPP (Programming Massively Parallel Processors) textbook problems


3. AMD DeepSeek Kernels Competition -- Round 1 ($100K, ~April 2025)

FieldDetail
Prize Pool$100,000 cash
HardwareAMD MI300X (provided free via GPU MODE Discord)
ThemeLLM inference kernels for DeepSeek-V3 on AMD
FormatOnline, leaderboard-based via KernelBot
Submissions~40K kernels submitted

Problems:

Key result: The top submission for FP8 Matmul on AMD MI300 was faster than the optimized AMD in-house kernel on some matrix shapes used directly in DeepSeek-V3. This demonstrated that community-driven optimization can exceed vendor kernels.

Winner details: Specific winners not publicly announced in detail at time of research. AMD selected 4 winning submissions after reproduction and verification. Select participants invited to AMD DevDay 2025 during Open-Source AI Week.


4. AMD Distributed Kernels Competition -- Round 2 ($100K, ~2025)

FieldDetail
Prize Pool$100,000 cash
HardwareAMD Instinct GPUs (multi-GPU)
ThemeMulti-GPU communication kernels for LLM training
FormatOnline, leaderboard-based via KernelBot

Problems (weighted scoring):

Scoring: Top 10 performant kernels considered per problem. 4 winners selected by AMD after reproduction/verification. Leaderboard placement not a guarantee of winning.


5. TriMul / AlphaFold3 Competition (~2025)

FieldDetail
Prize PoolMerch (no cash)
HardwareNVIDIA H100, A100, B200; AMD MI300X
ThemeTriangular Multiplicative Update from AlphaFold3
FormatOnline, leaderboard-based

Key techniques by winners:

Notable result: K-Search (automated kernel generation system) achieved state-of-the-art at 1028 microseconds on H100, surpassing both prior automated and human-designed solutions.


6. Modular x GPU MODE Mojo Hackathon (May 10, 2025, AGI House)

FieldDetail
DateMay 10, 2025
LocationAGI House, Hillsborough, CA
Participants100+ engineers and researchers
HardwareAMD MI300X (via Crusoe Cloud)
SponsorsModular, AMD, Crusoe, GPU MODE
FormatIn-person hackathon

Notable winners:

Speakers: Chris Lattner (Modular), Dylan Patel (SemiAnalysis), representatives from AMD and Anthropic.


7. SemiAnalysis Blackwell Hackathon / GPU MODE IRL #2 (~Late 2025)

FieldDetail
LocationAccel office, San Francisco
HardwareNVIDIA Blackwell GB200 (provided by Nebius AI)
FormatIn-person hackathon, project-based judging
SponsorsTogether.ai, Lambda Labs, Google Cloud, NVIDIA, GPU MODE, Thinking Machines, OpenAI, PyTorch, CoreWeave, Nebius

Winners:

PlaceTeamProjectResult
1stSymmetric MindsMulti-GPU expert-parallel MoE inference in vLLM with symmetric memory pool + Triton kernel refinements~1.5x speedup over DeepEP on GB200
2ndFlash HogsFlash-HOG: custom Blackwell kernel for higher-order attention gradients~3x speed, linear memory cost vs XLA, validated to 1M tokens
3rdDelta NetContext-parallel Gated DeltaNetScaling to 128K+ context on single GB200 node with 8B hybrid GDN-Transformer
4thKernelEvolveLLM-driven search + RL to auto-tune Helion GPU kernelsTuning time from hours to <30 min
Special 1st (SemiAnalysis)Arun DemeureOptimizing Blackwell's Split L2100x+ reduction in cross-chip data transfer and power consumption

Speakers:


8. Blackwell NVFP4 Kernel Hackathon (November 2025 -- February 2026)

FieldDetail
OrganizersNVIDIA + GPU MODE
HardwareNVIDIA B200 (Blackwell); DGX B200 compute via Sesterce
Duration~3 months (Nov 2025 -- Feb 2026)
FormatOnline, 4 sequential kernel problems, leaderboard-based
PrizesPer problem: 1st DGX Spark + GTC 2026 pass; 2nd RTX 5090 + GTC pass; 3rd RTX 5080. Grand prize for overall fastest (weighted sum). Top 2 per problem invited to GTC awards ceremony.

Problems:

  1. Batched GEMV (NVFP4 format)
  2. GEMM (NVFP4 format)
  3. Gated Dual GEMM (with SiLU activation)
  4. Grouped GEMM

Scoring methodology: Geometric mean of benchmark times against a "speed-of-light" baseline derived from max(FFMA math throughput, DRAM memory throughput) on B200 at 1.5 GHz.

What winning kernels used (from participant blogs):

What did NOT work:

Key lesson from a participant: "The single most important thing I could have done was run Nsight Compute" to identify whether the bottleneck was memory or compute. Five optimization attempts failed because they targeted compute improvements on a memory-bandwidth-limited kernel.

Surprising finding: A pure PyTorch solution using torch._scaled_mm with cuBLAS FP4 backend achieved 22.4 microseconds -- within striking distance of hand-optimized kernels. One top submitter had never hand-written GPU code before.


9. Jane Street x GPU MODE NYC Hackathon (~February 2026)

FieldDetail
LocationNew York City (Jane Street office)
Prize Pool$50,000 grand prize (highest single prize)
FormatIn-person, 1-day hackathon, leaderboard-based
ThemeModel optimization
GPU hours consumed26,000

Winners:

Speakers/Panelists:

Note: Jane Street talk "Making GPUs Actually Fast: A Deep Dive into Training Performance" was released alongside this event.


10. AMD x GPU MODE E2E Model Speedrun (March -- May 2026, CURRENT)

FieldDetail
Prize Pool$1,100,000 total cash
HardwareAMD Instinct MI355X (CDNA4)
FormatOnline, 2 phases: qualifier (kernel leaderboard) + finals (end-to-end inference)
Qualifier deadlineRegistration: March 30; Submissions: April 6
Finals modelsTrack 1: DeepSeek-R1-0528 ($350K); Track 2: Kimi K2.5 1T ($650K)

Qualifier problems:

Covered in detail in gpu-mode-entry-guide.md and gpu-mode-advanced-strategy.md.


11. GPU MODE IRL Paris -- PyTorch Conference Europe (April 9, 2026, UPCOMING)

FieldDetail
DateApril 9, 2026 (side event to PyTorch Conference Europe April 7-8)
LocationParis, France
HardwareNVIDIA Blackwell Ultra (via Verda) + Hopper (via Sesterce)
ThemeLLM speedrun: pre-train an LLM from scratch on Blackwell Ultra cluster (360 PFLOP/s BF16) + ML systems optimization tasks

Winning Project Patterns

What approaches won?

IRL hackathons (project-based):

Online leaderboard competitions (speed-based):

Common techniques across winners

  1. Profiling first, optimizing second -- identifying whether bottleneck is memory or compute before choosing optimization strategy
  2. Hardware intrinsics -- using vendor-provided intrinsics rather than reimplementing format conversion
  3. Register pressure management -- winning kernels use 32-45 registers vs naive 80
  4. Aggressive fusion -- fusing elementwise ops, normalization, gating, and compute into single kernels
  5. Systematic iteration -- hundreds to thousands of submissions, varying one parameter at a time
  6. Leveraging vendor libraries as baselines -- cuBLAS/hipBLASLt as floor, then exceeding with architecture-specific tuning

What surprised judges?

Speed improvements achieved

CompetitionBaselineWinningImprovement
NVFP4 GEMV~100 us (CuTe DSL)~22 us (CUDA)~4.5x
AMD FP8 GEMMAMD in-house kernelCommunity submissionFaster on some shapes
IRL #2 MoE inferenceDeepEP baselineSymmetric Minds~1.5x
IRL #2 attention gradientsXLA baselineFlash-HOG~3x
CDNA4 GEMM optimization stack1.15 TFLOPS (naive)2,680 TFLOPS (optimized)~2,330x
Blackwell Split L2Standard reductionArun Demeure's kernel100x+ data transfer reduction

Judging Criteria Differences

IRL Hackathons (project-based)

Online Leaderboard Competitions (speed-based)

"Speed of Light" Analysis in Practice

  1. Calculate theoretical peak: max(compute_throughput, memory_bandwidth) for the specific operation
  2. For GEMM: 2 * M * N * K / peak_FLOPS (compute-bound) or (M*K + K*N + M*N) * bytes / peak_bandwidth (memory-bound)
  3. Real-world achievable: typically 53-58% of theoretical peak; hipBLASLt/cuBLAS achieve ~97% for large GEMMs
  4. The gap between your kernel and the SoL is your optimization opportunity
  5. Winners typically get within 2x of SoL; top competitors within 1.5x

Winner Profiles

Known Top Competitors

Name/TeamBackgroundNotable Wins
"Nader" (NVIDIA CCCL team)Industry (NVIDIA)PMPP leaderboard top across 4 GPU architectures
Arun DemeureIndependent/industrySemiAnalysis hackathon 1st place (Blackwell Split L2)
Symmetric Minds (team)Unknown (likely industry/research)IRL #2 1st place (MoE inference on GB200)
Flash Hogs (team)UnknownIRL #2 2nd place (Flash-HOG attention gradients)
Delta Net (team)UnknownIRL #2 3rd place (context-parallel Gated DeltaNet)
KernelEvolve (team)Unknown (LLM+RL approach)IRL #2 4th place (automated kernel tuning)
Nadav TimorUnknownJane Street $50K grand prize winner
Kyle YuUnknownJane Street hackathon 1st place
Marcel Roed, Herman Brunborg, Rajat V. DwaraknathPhD students, StanfordModular/Mojo hackathon winners
"Danishlynx"UnknownCurrent AMD E2E Speedrun leader (top in 2/3 kernels)

Academic vs Industry vs Indie Breakdown

Based on available data, the competitor pool is diverse:

Pattern: IRL hackathons tend to be won by people with deep systems knowledge (able to build complete systems in 14 hours). Online leaderboards are more accessible to newcomers who can iterate via brute-force submission.

Team Sizes That Win


Evolution of Difficulty

2024: "Build Something Cool"

Early 2025: "Write a Fast Kernel" (Standard Problems)

Mid 2025: "Write a Fast Production Kernel" ($100K Competitions)

Late 2025: "Write a Fast Bleeding-Edge Kernel" (Blackwell NVFP4)

2026: "Optimize an Entire Model End-to-End" ($1.1M)

Summary: The bar has risen from "write any GPU code" to "optimize bleeding-edge operations on unreleased hardware, end-to-end, faster than vendor libraries."


Lessons for the Current E2E Speedrun

What can we learn from past winners' approaches?

  1. Profile before optimizing. The NVFP4 competitor who wasted 5 attempts on compute optimizations for a memory-bound kernel is a cautionary tale. Run rocprof-compute first.
  1. Start with vendor baselines. cuBLAS/hipBLASLt/AITER provide strong floors. In the NVFP4 competition, torch._scaled_mm achieved near-competitive results. In the AMD competition, AITER's assembly kernels are the starting point for MLA Decode.
  1. Geometric mean scoring punishes imbalance. You MUST submit to all three kernels. Being top 5 in one kernel but bottom 50% in others loses to someone top 20 in all three. Past AMD competitions used similar weighted scoring.
  1. High submission count correlates with placement. Top performers submit 500-3600+ times. The iteration loop (profile -> hypothesize -> implement -> test -> benchmark -> submit) is the core skill.
  1. Hand-written HIP/CUDA with assembly beats Triton for GEMMs. Triton is 30-50% slower for GEMM due to inability to express ping-pong scheduling or GLOBALLOADLDS. But Triton is viable for MLA and MoE dispatch logic.
  1. Hardware intrinsics matter more than clever algorithms. Using __nv_cvt_fp4x2_to_halfraw2 instead of manual bit manipulation was the difference between competitive and not in NVFP4. For AMD, the MFMA 16x16x128 instruction with FP4 inputs is the equivalent.
  1. Register pressure is a hidden killer. Winning NVFP4 kernels used 32-45 registers; naive implementations used 80. On CDNA4, register pressure affects occupancy and ping-pong scheduling.
  1. Community beats vendor (sometimes). AMD's own in-house FP8 GEMM kernel was beaten by community submissions on some shapes. This means there IS room to exceed existing baselines.
  1. Newcomers can compete. One NVFP4 top performer had no GPU coding experience. Modern tools (cuda.compute, Triton, torch.scaledmm) lower the barrier significantly.
  1. For IRL events: build something that matters. Winners built tools that could be merged into vLLM, PyTorch, or CUTLASS -- not academic exercises.

Common mistakes to avoid

  1. Optimizing compute on a memory-bound kernel (or vice versa). Profile first.
  2. Neglecting one kernel in geometric-mean scoring. A zero in any kernel = zero overall.
  3. Over-engineering early. Start with a working baseline, submit it, then iterate.
  4. **Not using --mode test before --mode benchmark.** Silent correctness failures waste time.
  5. Ignoring scale factor shuffling for MXFP4. Getting this wrong produces silently wrong results.
  6. Trying to implement everything from scratch instead of building on AITER/CK-Tile/hipBLASLt.
  7. Not tracking submissions systematically. Track: timestamp, kernel, score, what changed.
  8. Spending too long reading papers before submitting. The top competitors learn by iterating on submissions, not by reading.

All GPU MODE Competition Events Summary

#EventDateFormatHardwarePrize PoolKey Winners
1CUDA MODE IRL #1Sep 2024In-person, projectNVIDIA GPUs~$40K credits + RTX 4080sFlexible attention masks, NCCL in Triton
2PMPP LeaderboardFeb 2025Online, speedB200/H100/A100/L4RecognitionNVIDIA CCCL ("Nader")
3AMD DeepSeek Kernels R1~Apr 2025Online, speedMI300X$100K cashCommunity beat AMD in-house on FP8 GEMM
4AMD Distributed Kernels R2~2025Online, speedMI300X (multi)$100K cash4 winners selected by AMD
5TriMul / AlphaFold3~2025Online, speedH100/A100/B200/MI300XMerchK-Search (automated) beat humans
6Modular Mojo HackathonMay 2025In-person, projectMI300XUnknownStanford PhD team (Transformer training in Mojo)
7SemiAnalysis / IRL #2Late 2025In-person, projectGB200 (Blackwell)Credits + recognitionSymmetric Minds, Arun Demeure
8NVFP4 Kernel HackathonNov 2025-Feb 2026Online, speedB200DGX Spark/RTX 5090/5080 per problemPTX assembly specialists
9Jane Street x GPU MODE~Feb 2026In-person, speedUnknown$50K grand prizeNadav Timor ($50K), Kyle Yu (1st)
10AMD E2E SpeedrunMar-May 2026Online, speed+E2EMI355X$1.1M cashIN PROGRESS
11Paris IRL (PyTorch Conf)Apr 9, 2026In-person, projectBlackwell Ultra + HopperTBDUPCOMING