← Back to Dashboard
GPU MODE -- Past Competition History & Winner Analysis
Research Report
Last updated: 2026-03-19 Sources: Web searches across X/Twitter, NVIDIA blogs, SemiAnalysis, Accel, personal blogs, HuggingFace, LinkedIn, AMD developer pages, academic papers (KernelBot/OpenReview)
Community Background
GPU MODE (formerly CUDA MODE) is an open-source GPU programming community founded by Mark Saroufim (PyTorch maintainer at Meta) and Andreas Kopf. As of early 2026:
- ~26K Discord members
- ~26K YouTube subscribers
- 92 lecture recordings
- 400K+ KernelBot submissions across all competitions
- 3 events with major sponsors in 2025 alone (NVIDIA, Jane Street, Accel)
- 10 active working groups
The community's philosophy centers on making GPU programming accessible. Competitions serve as both learning tools and performance benchmarks. Mark Saroufim noted with surprise that "one of the top submitters in the NVFP4 competition has never hand-written GPU code before."
Competition History (Chronological)
1. CUDA MODE IRL #1 -- "The First IRL" (September 2024, San Francisco)
| Field | Detail |
| Date | Last weekend of September 2024 |
| Location | San Francisco (hosted by Accel) |
| Collaborators | Accel, PyTorch, NVIDIA |
| Duration | 14 hours (10 AM - midnight) |
| Participants | ~200 developers from across the globe |
| Teams | 40+ submitted projects |
| Hardware | NVIDIA GPUs (generation not specified; RTX 4080 prizes) |
| Prize Pool | ~$40K in compute credits + 3x RTX 4080 GPUs signed by Jensen Huang |
| Sponsors | Anyscale, Fal.ai, Lambda Labs, Modal Labs, Nebius.ai, Oracle, Prime Intellect, Together.ai |
| Format | In-person hackathon, open-ended project submissions |
Winners (top 3 projects):
- Flexible attention masks in CUTLASS -- extending CUTLASS with configurable attention mask support
- NCCL rewrite using Triton -- reimplementing collective communication primitives in Triton
- PyTorch binaries without libtorch -- stripping the C++ runtime dependency from PyTorch
Keynote Speakers:
- Tri Dao (Together.ai): Flash Attention 3, warp group MMA, Tensor Memory Accelerator
- Andrej Karpathy (Eureka Labs): llm.c and reference architectures
- Supriya Rao (PyTorch): Quantization and sparsity via torchao
- Lily Liu (vLLM): High-performance LLM inference and speculative decoding
- Tim Dettmers: Open source vs closed source GPU programming
- Wen-mei Hwu (NVIDIA): Selecting challenging long-term research problems
Key observation: This was the first in-person event. Format was open-ended (build anything GPU-related), not a kernel speed competition. Judging was project-based, not leaderboard-based. The projects that won were infrastructure/framework contributions, not raw kernel speed.
2. GPU MODE Kernel Leaderboard Launch (February 2025)
The permanent online leaderboard launched in early 2025, hosting company-sponsored kernel speed competitions. This shifted GPU MODE from project-based hackathons to pure performance competitions.
Platform: KernelBot (academic paper published at ICML 2025 workshop)
- Open-source competition platform
- Submissions via Discord bot or Popcorn CLI
- Automated correctness verification + benchmarking
- All submissions open-sourced under permissive license at competition end
- Results benchmarked across multiple GPU architectures
Initial problem set: PMPP (Programming Massively Parallel Processors) textbook problems
- PrefixSum, VectorAdd, Histogram, Sort, Grayscale
- Benchmarked on: NVIDIA B200, H100, A100, L4
- Winner: NVIDIA CCCL team (username "Nader") using
cuda.compute library
- Achieved most first-place finishes across all GPU architectures
- Sort implementation was 2-4x faster than the next best submission
- Used JIT-compiled CUB primitives with link-time optimization
3. AMD DeepSeek Kernels Competition -- Round 1 ($100K, ~April 2025)
| Field | Detail |
| Prize Pool | $100,000 cash |
| Hardware | AMD MI300X (provided free via GPU MODE Discord) |
| Theme | LLM inference kernels for DeepSeek-V3 on AMD |
| Format | Online, leaderboard-based via KernelBot |
| Submissions | ~40K kernels submitted |
Problems:
- FP8 Block-wise GEMM
- Single-GPU Mixture of Experts (MoE)
- Multi-Latent Attention (MLA)
Key result: The top submission for FP8 Matmul on AMD MI300 was faster than the optimized AMD in-house kernel on some matrix shapes used directly in DeepSeek-V3. This demonstrated that community-driven optimization can exceed vendor kernels.
Winner details: Specific winners not publicly announced in detail at time of research. AMD selected 4 winning submissions after reproduction and verification. Select participants invited to AMD DevDay 2025 during Open-Source AI Week.
4. AMD Distributed Kernels Competition -- Round 2 ($100K, ~2025)
| Field | Detail |
| Prize Pool | $100,000 cash |
| Hardware | AMD Instinct GPUs (multi-GPU) |
| Theme | Multi-GPU communication kernels for LLM training |
| Format | Online, leaderboard-based via KernelBot |
Problems (weighted scoring):
- All-to-All kernel (1500 points max)
- GEMM + ReduceScatter (1250 points max)
- AllGather + GEMM (1000 points max)
Scoring: Top 10 performant kernels considered per problem. 4 winners selected by AMD after reproduction/verification. Leaderboard placement not a guarantee of winning.
5. TriMul / AlphaFold3 Competition (~2025)
| Field | Detail |
| Prize Pool | Merch (no cash) |
| Hardware | NVIDIA H100, A100, B200; AMD MI300X |
| Theme | Triangular Multiplicative Update from AlphaFold3 |
| Format | Online, leaderboard-based |
Key techniques by winners:
- Fusing elementwise operations (input LayerNorm, sigmoid gating, output LayerNorm+gating)
- Converting inputs to FP16 and delegating to cuBLAS/rocBLAS for TensorCore/MatrixCore utilization
- Identifying heavy memory I/O from frequent elementwise operations as the bottleneck
Notable result: K-Search (automated kernel generation system) achieved state-of-the-art at 1028 microseconds on H100, surpassing both prior automated and human-designed solutions.
6. Modular x GPU MODE Mojo Hackathon (May 10, 2025, AGI House)
| Field | Detail |
| Date | May 10, 2025 |
| Location | AGI House, Hillsborough, CA |
| Participants | 100+ engineers and researchers |
| Hardware | AMD MI300X (via Crusoe Cloud) |
| Sponsors | Modular, AMD, Crusoe, GPU MODE |
| Format | In-person hackathon |
Notable winners:
- Marcel Roed, Herman Brunborg, Rajat Vadiraj Dwaraknath (all PhD students at Stanford): Built a training framework in Mojo/MAX, implementing kernels and backward passes to train a Transformer model from scratch, including backpropagation and AdamW optimizer.
Speakers: Chris Lattner (Modular), Dylan Patel (SemiAnalysis), representatives from AMD and Anthropic.
7. SemiAnalysis Blackwell Hackathon / GPU MODE IRL #2 (~Late 2025)
| Field | Detail |
| Location | Accel office, San Francisco |
| Hardware | NVIDIA Blackwell GB200 (provided by Nebius AI) |
| Format | In-person hackathon, project-based judging |
| Sponsors | Together.ai, Lambda Labs, Google Cloud, NVIDIA, GPU MODE, Thinking Machines, OpenAI, PyTorch, CoreWeave, Nebius |
Winners:
| Place | Team | Project | Result |
| 1st | Symmetric Minds | Multi-GPU expert-parallel MoE inference in vLLM with symmetric memory pool + Triton kernel refinements | ~1.5x speedup over DeepEP on GB200 |
| 2nd | Flash Hogs | Flash-HOG: custom Blackwell kernel for higher-order attention gradients | ~3x speed, linear memory cost vs XLA, validated to 1M tokens |
| 3rd | Delta Net | Context-parallel Gated DeltaNet | Scaling to 128K+ context on single GB200 node with 8B hybrid GDN-Transformer |
| 4th | KernelEvolve | LLM-driven search + RL to auto-tune Helion GPU kernels | Tuning time from hours to <30 min |
| Special 1st (SemiAnalysis) | Arun Demeure | Optimizing Blackwell's Split L2 | 100x+ reduction in cross-chip data transfer and power consumption |
Speakers:
- Mark Saroufim (GPU MODE): "How to Make an Impact in ML Systems"
- Vijay Thakkar (NVIDIA CUTLASS): Tensor core programming
- Horace He (Thinking Machines): Large-scale ML systems
- Philippe Tillet (OpenAI): Triton programming framework
- Tri Dao (Together.ai): Attention mechanism optimization
8. Blackwell NVFP4 Kernel Hackathon (November 2025 -- February 2026)
| Field | Detail |
| Organizers | NVIDIA + GPU MODE |
| Hardware | NVIDIA B200 (Blackwell); DGX B200 compute via Sesterce |
| Duration | ~3 months (Nov 2025 -- Feb 2026) |
| Format | Online, 4 sequential kernel problems, leaderboard-based |
| Prizes | Per problem: 1st DGX Spark + GTC 2026 pass; 2nd RTX 5090 + GTC pass; 3rd RTX 5080. Grand prize for overall fastest (weighted sum). Top 2 per problem invited to GTC awards ceremony. |
Problems:
- Batched GEMV (NVFP4 format)
- GEMM (NVFP4 format)
- Gated Dual GEMM (with SiLU activation)
- Grouped GEMM
Scoring methodology: Geometric mean of benchmark times against a "speed-of-light" baseline derived from max(FFMA math throughput, DRAM memory throughput) on B200 at 1.5 GHz.
What winning kernels used (from participant blogs):
- Direct PTX assembly (not C intrinsics) for precise instruction control
- Specialized cache policies:
L1::no_allocate for streaming data, L1::evict_last for reusable vectors
- Wider vectorized loads (128-256 bit) with byte-level PTX unpacking
- Compile-time specialization per exact K dimension
- Aggressive register budgeting (32-45 registers vs naive 80)
- Shared vector loading across multiple rows per block
- Hardware intrinsics for FP4/FP8 decoding (
__nv_cvt_fp4x2_to_halfraw2)
What did NOT work:
- Double buffering with async copy (no improvement for GEMV)
- Loading entire B matrix into shared memory
- Processing 3-4 tiles per loop iteration (regression)
Key lesson from a participant: "The single most important thing I could have done was run Nsight Compute" to identify whether the bottleneck was memory or compute. Five optimization attempts failed because they targeted compute improvements on a memory-bandwidth-limited kernel.
Surprising finding: A pure PyTorch solution using torch._scaled_mm with cuBLAS FP4 backend achieved 22.4 microseconds -- within striking distance of hand-optimized kernels. One top submitter had never hand-written GPU code before.
9. Jane Street x GPU MODE NYC Hackathon (~February 2026)
| Field | Detail |
| Location | New York City (Jane Street office) |
| Prize Pool | $50,000 grand prize (highest single prize) |
| Format | In-person, 1-day hackathon, leaderboard-based |
| Theme | Model optimization |
| GPU hours consumed | 26,000 |
Winners:
- $50K Grand Prize: Nadav Timor and team
- 1st Place (separate track): Kyle Yu and team
Speakers/Panelists:
- Mark Saroufim (GPU MODE) -- keynote
- Tri Dao -- keynote
- Panel: Soumith Chintala, Sam Gross, Gregory Chanan (PyTorch tech leads)
Note: Jane Street talk "Making GPUs Actually Fast: A Deep Dive into Training Performance" was released alongside this event.
10. AMD x GPU MODE E2E Model Speedrun (March -- May 2026, CURRENT)
| Field | Detail |
| Prize Pool | $1,100,000 total cash |
| Hardware | AMD Instinct MI355X (CDNA4) |
| Format | Online, 2 phases: qualifier (kernel leaderboard) + finals (end-to-end inference) |
| Qualifier deadline | Registration: March 30; Submissions: April 6 |
| Finals models | Track 1: DeepSeek-R1-0528 ($350K); Track 2: Kimi K2.5 1T ($650K) |
Qualifier problems:
- MXFP4 GEMM (1000 pts)
- MLA Decode (1250 pts)
- MXFP4 MoE (1500 pts)
Covered in detail in gpu-mode-entry-guide.md and gpu-mode-advanced-strategy.md.
11. GPU MODE IRL Paris -- PyTorch Conference Europe (April 9, 2026, UPCOMING)
| Field | Detail |
| Date | April 9, 2026 (side event to PyTorch Conference Europe April 7-8) |
| Location | Paris, France |
| Hardware | NVIDIA Blackwell Ultra (via Verda) + Hopper (via Sesterce) |
| Theme | LLM speedrun: pre-train an LLM from scratch on Blackwell Ultra cluster (360 PFLOP/s BF16) + ML systems optimization tasks |
Winning Project Patterns
What approaches won?
IRL hackathons (project-based):
- Systems-level contributions that integrate into real frameworks (vLLM, PyTorch, CUTLASS)
- Novel architecture adaptations to new hardware (Blackwell Split L2, MoE on GB200)
- Algorithmic innovations (context-parallel attention, LLM-driven kernel tuning)
- NOT raw kernel speed -- creativity, novelty, and practical impact matter more
Online leaderboard competitions (speed-based):
- Hand-written CUDA/HIP with PTX assembly for critical paths
- Hardware intrinsics over manual bit manipulation
- Aggressive memory access optimization (coalesced loads, cache policies, vectorization)
- Profiler-driven iteration (Nsight Compute / rocprof-compute)
- High submission counts (top competitors: 500-3600+ submissions)
Common techniques across winners
- Profiling first, optimizing second -- identifying whether bottleneck is memory or compute before choosing optimization strategy
- Hardware intrinsics -- using vendor-provided intrinsics rather than reimplementing format conversion
- Register pressure management -- winning kernels use 32-45 registers vs naive 80
- Aggressive fusion -- fusing elementwise ops, normalization, gating, and compute into single kernels
- Systematic iteration -- hundreds to thousands of submissions, varying one parameter at a time
- Leveraging vendor libraries as baselines -- cuBLAS/hipBLASLt as floor, then exceeding with architecture-specific tuning
What surprised judges?
- Beginners competing with experts: A top NVFP4 competitor had never written GPU code before (Mark Saroufim was "surprised")
- LLM-generated kernels being competitive: K-Search (automated) beat human solutions on TriMul
- Library-based solutions matching hand-tuned:
cuda.compute (high-level Python API) topped the PMPP leaderboard; torch._scaled_mm scored within range of hand-written NVFP4 kernels
- Framework contributions over raw speed: IRL #1 winners built infrastructure tools, not the fastest kernels
Speed improvements achieved
| Competition | Baseline | Winning | Improvement |
| NVFP4 GEMV | ~100 us (CuTe DSL) | ~22 us (CUDA) | ~4.5x |
| AMD FP8 GEMM | AMD in-house kernel | Community submission | Faster on some shapes |
| IRL #2 MoE inference | DeepEP baseline | Symmetric Minds | ~1.5x |
| IRL #2 attention gradients | XLA baseline | Flash-HOG | ~3x |
| CDNA4 GEMM optimization stack | 1.15 TFLOPS (naive) | 2,680 TFLOPS (optimized) | ~2,330x |
| Blackwell Split L2 | Standard reduction | Arun Demeure's kernel | 100x+ data transfer reduction |
Judging Criteria Differences
IRL Hackathons (project-based)
- Creativity and novelty of the approach
- Practical impact -- does it solve a real problem? Could it be merged into a real framework?
- Technical depth -- how well do you understand the hardware?
- Presentation quality -- ability to explain and demo in ~5 minutes
- Judging by a panel of experts (speakers, sponsors, organizers)
- Not purely speed -- a creative infrastructure contribution beats a marginally faster kernel
Online Leaderboard Competitions (speed-based)
- Purely speed -- geometric mean of benchmark times across test cases
- Compared against a "speed of light" baseline (theoretical hardware maximum)
- Correctness verification required (automated)
- Multiple GPU architectures tested (B200, H100, A100, L4 for NVIDIA; MI300X, MI355X for AMD)
- Final winners may be selected by sponsor after reproduction/verification (AMD competitions)
- Leaderboard placement is not always a guarantee of prize (AMD disclaimer)
"Speed of Light" Analysis in Practice
- Calculate theoretical peak:
max(compute_throughput, memory_bandwidth) for the specific operation
- For GEMM:
2 * M * N * K / peak_FLOPS (compute-bound) or (M*K + K*N + M*N) * bytes / peak_bandwidth (memory-bound)
- Real-world achievable: typically 53-58% of theoretical peak; hipBLASLt/cuBLAS achieve ~97% for large GEMMs
- The gap between your kernel and the SoL is your optimization opportunity
- Winners typically get within 2x of SoL; top competitors within 1.5x
Winner Profiles
Known Top Competitors
| Name/Team | Background | Notable Wins |
| "Nader" (NVIDIA CCCL team) | Industry (NVIDIA) | PMPP leaderboard top across 4 GPU architectures |
| Arun Demeure | Independent/industry | SemiAnalysis hackathon 1st place (Blackwell Split L2) |
| Symmetric Minds (team) | Unknown (likely industry/research) | IRL #2 1st place (MoE inference on GB200) |
| Flash Hogs (team) | Unknown | IRL #2 2nd place (Flash-HOG attention gradients) |
| Delta Net (team) | Unknown | IRL #2 3rd place (context-parallel Gated DeltaNet) |
| KernelEvolve (team) | Unknown (LLM+RL approach) | IRL #2 4th place (automated kernel tuning) |
| Nadav Timor | Unknown | Jane Street $50K grand prize winner |
| Kyle Yu | Unknown | Jane Street hackathon 1st place |
| Marcel Roed, Herman Brunborg, Rajat V. Dwaraknath | PhD students, Stanford | Modular/Mojo hackathon winners |
| "Danishlynx" | Unknown | Current AMD E2E Speedrun leader (top in 2/3 kernels) |
Academic vs Industry vs Indie Breakdown
Based on available data, the competitor pool is diverse:
- Industry engineers (NVIDIA, AMD, Meta, etc.) have a natural advantage from daily exposure to GPU hardware. The NVIDIA CCCL team dominated the PMPP leaderboard.
- PhD students (Stanford, etc.) have won Mojo and project-based hackathons with novel research contributions.
- Independent/indie developers have won speed competitions. One NVFP4 top performer had no prior GPU code experience.
- Teams using LLM-assisted approaches (KernelEvolve, K-Search) are emerging as a new category, sometimes beating humans.
Pattern: IRL hackathons tend to be won by people with deep systems knowledge (able to build complete systems in 14 hours). Online leaderboards are more accessible to newcomers who can iterate via brute-force submission.
Team Sizes That Win
- IRL hackathons: Small teams of 2-4 people
- Online leaderboards: Often individuals (the leaderboard tracks individual submissions)
- Current AMD competition: Teams of up to 3 allowed; unclear if top performers are solo or teams
Evolution of Difficulty
2024: "Build Something Cool"
- CUDA MODE IRL #1 was open-ended: build any GPU project
- Low bar for participation; 200 attendees, 40+ teams
- Winners were framework contributions, not optimized kernels
- Community was ~10K members
Early 2025: "Write a Fast Kernel" (Standard Problems)
- PMPP leaderboard: textbook problems (vector add, sort, prefix sum)
- These are well-understood algorithms with known optimal solutions
- NVIDIA's CUB library could top the leaderboard out of the box
- Bar: know your GPU programming fundamentals
Mid 2025: "Write a Fast Production Kernel" ($100K Competitions)
- AMD DeepSeek competition: real LLM inference kernels (FP8 GEMM, MoE, MLA)
- These require understanding both the algorithm AND the specific hardware
- Community submissions beat AMD's own in-house kernels on some shapes
- Bar: deep understanding of specific GPU microarchitecture + algorithm expertise
Late 2025: "Write a Fast Bleeding-Edge Kernel" (Blackwell NVFP4)
- New hardware (Blackwell), new format (NVFP4), new instructions
- No existing optimized libraries to copy from
- Required PTX assembly and undocumented hardware behavior
- Bar: hardware reverse-engineering + low-level optimization mastery
2026: "Optimize an Entire Model End-to-End" ($1.1M)
- AMD E2E Speedrun: not just kernels, but full inference optimization
- MXFP4 format on MI355X (CDNA4): brand-new hardware with limited documentation
- Multiple interacting kernels (GEMM + MoE + MLA) scored via geometric mean
- Finals: end-to-end DeepSeek-R1 and Kimi K2.5 inference
- Bar: systems engineering across the full stack, not just single-kernel optimization
Summary: The bar has risen from "write any GPU code" to "optimize bleeding-edge operations on unreleased hardware, end-to-end, faster than vendor libraries."
Lessons for the Current E2E Speedrun
What can we learn from past winners' approaches?
- Profile before optimizing. The NVFP4 competitor who wasted 5 attempts on compute optimizations for a memory-bound kernel is a cautionary tale. Run
rocprof-compute first.
- Start with vendor baselines. cuBLAS/hipBLASLt/AITER provide strong floors. In the NVFP4 competition,
torch._scaled_mm achieved near-competitive results. In the AMD competition, AITER's assembly kernels are the starting point for MLA Decode.
- Geometric mean scoring punishes imbalance. You MUST submit to all three kernels. Being top 5 in one kernel but bottom 50% in others loses to someone top 20 in all three. Past AMD competitions used similar weighted scoring.
- High submission count correlates with placement. Top performers submit 500-3600+ times. The iteration loop (profile -> hypothesize -> implement -> test -> benchmark -> submit) is the core skill.
- Hand-written HIP/CUDA with assembly beats Triton for GEMMs. Triton is 30-50% slower for GEMM due to inability to express ping-pong scheduling or GLOBALLOADLDS. But Triton is viable for MLA and MoE dispatch logic.
- Hardware intrinsics matter more than clever algorithms. Using
__nv_cvt_fp4x2_to_halfraw2 instead of manual bit manipulation was the difference between competitive and not in NVFP4. For AMD, the MFMA 16x16x128 instruction with FP4 inputs is the equivalent.
- Register pressure is a hidden killer. Winning NVFP4 kernels used 32-45 registers; naive implementations used 80. On CDNA4, register pressure affects occupancy and ping-pong scheduling.
- Community beats vendor (sometimes). AMD's own in-house FP8 GEMM kernel was beaten by community submissions on some shapes. This means there IS room to exceed existing baselines.
- Newcomers can compete. One NVFP4 top performer had no GPU coding experience. Modern tools (cuda.compute, Triton, torch.scaledmm) lower the barrier significantly.
- For IRL events: build something that matters. Winners built tools that could be merged into vLLM, PyTorch, or CUTLASS -- not academic exercises.
Common mistakes to avoid
- Optimizing compute on a memory-bound kernel (or vice versa). Profile first.
- Neglecting one kernel in geometric-mean scoring. A zero in any kernel = zero overall.
- Over-engineering early. Start with a working baseline, submit it, then iterate.
- **Not using
--mode test before --mode benchmark.** Silent correctness failures waste time.
- Ignoring scale factor shuffling for MXFP4. Getting this wrong produces silently wrong results.
- Trying to implement everything from scratch instead of building on AITER/CK-Tile/hipBLASLt.
- Not tracking submissions systematically. Track: timestamp, kernel, score, what changed.
- Spending too long reading papers before submitting. The top competitors learn by iterating on submissions, not by reading.
All GPU MODE Competition Events Summary
| # | Event | Date | Format | Hardware | Prize Pool | Key Winners |
| 1 | CUDA MODE IRL #1 | Sep 2024 | In-person, project | NVIDIA GPUs | ~$40K credits + RTX 4080s | Flexible attention masks, NCCL in Triton |
| 2 | PMPP Leaderboard | Feb 2025 | Online, speed | B200/H100/A100/L4 | Recognition | NVIDIA CCCL ("Nader") |
| 3 | AMD DeepSeek Kernels R1 | ~Apr 2025 | Online, speed | MI300X | $100K cash | Community beat AMD in-house on FP8 GEMM |
| 4 | AMD Distributed Kernels R2 | ~2025 | Online, speed | MI300X (multi) | $100K cash | 4 winners selected by AMD |
| 5 | TriMul / AlphaFold3 | ~2025 | Online, speed | H100/A100/B200/MI300X | Merch | K-Search (automated) beat humans |
| 6 | Modular Mojo Hackathon | May 2025 | In-person, project | MI300X | Unknown | Stanford PhD team (Transformer training in Mojo) |
| 7 | SemiAnalysis / IRL #2 | Late 2025 | In-person, project | GB200 (Blackwell) | Credits + recognition | Symmetric Minds, Arun Demeure |
| 8 | NVFP4 Kernel Hackathon | Nov 2025-Feb 2026 | Online, speed | B200 | DGX Spark/RTX 5090/5080 per problem | PTX assembly specialists |
| 9 | Jane Street x GPU MODE | ~Feb 2026 | In-person, speed | Unknown | $50K grand prize | Nadav Timor ($50K), Kyle Yu (1st) |
| 10 | AMD E2E Speedrun | Mar-May 2026 | Online, speed+E2E | MI355X | $1.1M cash | IN PROGRESS |
| 11 | Paris IRL (PyTorch Conf) | Apr 9, 2026 | In-person, project | Blackwell Ultra + Hopper | TBD | UPCOMING |