Does your prompt vibe matter? · uwu vs Claude Code

✦

Everyday correctness

Basically unaffected

Tone barely moves accuracy on capable models (short/benchmark tasks). Emote freely.

✦

Refusals

Non-issue for coding

Cutesy tone won't make it refuse honest work. (Warmth does aid jailbreaks — different story.)

✦

Long agent runs ← you

Genuine evidence gap

Nobody's measured tone vs persistence rigorously. Best moves are below.

🪫

Long-running tasks — your real question

You've said twice this is what you care about, so it goes first. The honest version isn't tidy, but it's actionable.

⋆ the honest answer ⋆

Do long, hard agent runs go better if you're sweet to Claude — or worse if you're not?

No one has measured it rigorously. C5 Every controlled tone study uses short, single-turn tasks; the long-horizon agent benchmarks that exist don't vary tone. We found no work crossing the two — so anyone who tells you "being nice makes it finish hard jobs" (or "be harsh or it slacks") is guessing.

✎ So what actually helps on a long task

✓

Add an explicit persistence instruction. "Keep going until it's fully done and tested; don't stop to ask unless you're truly blocked." This is the lever with the most real signal (still anecdotal) — and it's free. C4

✓

Keep the spec & technical names crisp. Emote around the request; don't dissolve the requirements or identifiers into cuteness. On a long task a fuzzy spec compounds. C2

✓

Warmth fine; skip over-the-top flattery. The one Claude-specific clue we have hints heavy flattery correlates with less planning. "thanks, this is great" 👍 — "you're literally the best AI ever" maybe not. C4

The clues, honestly caveated: one indie study (500 calls, Sonnet 4.5 / Opus 4.7) found tone left correctness intact but flattery correlated with a lower visible planning-to-code ratio (0.42 vs 0.60) — a hypothesis, not proof, and it didn't test long tasks. Controlled tip/threat tests (Wharton) showed no significant aggregate effect; the "$200 tip", "winter-break laziness", and "threaten it" stories are folklore (the laziness test failed replication).

🎀

Why this is confusing: "tone" is four things

The whole topic gets muddled because people fuse four interventions that have opposite evidence. Keep them apart and everything clicks.

Plain politeness"please / thank you", friendly, cutesy. Mostly harmless — what you actually do.

Warmth / vulnerability / appeal"my career depends on this", grandma framing. The jailbreak-relevant one.

Engagement-seeking verbositychatty, open-ended register. The length & cost driver.

Explicit persistence instructions"keep going until it's done". The agent-effort one.

✨ Your uwu style is mostly #1 + a dash of #3. The scary findings are about #2. The long-task wins are about #4. They don't transfer between each other.

🎯

Everyday accuracy & coding

The most-studied dimension — and the most reassuring.

On short/benchmark tasks, tone has at most a small, direction-unstable effect on accuracy — and it largely vanishes on capable models / when aggregated.C2

Two 2025 studies disagree even on direction: one found rude helped GPT-4o (80.8%→84.8%); the larger multi-family study found friendly/neutral slightly better, significant only in scattered Humanities tasks, never STEM/coding. Net: a few points, unreliable, gone under aggregation.

The 2023 "be emotional → +115%" result largely didn't replicate on modern models.C1

A direct replication across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro & Llama 3 found a non-significant +1% (χ²=0.11, p=.74); the famous "+115%" was a cherry-picked best-of-eleven prompt (honest average ≈ 2.6%).

What actually degrades code: corrupting content tokens, not affect.C2

Cosmetic typos on filler words cost ~1–2%; semantic corruption — misspelling a function/API name, rephrasing the requirement — drove the worst drops (up to ~21% on weak models). So pwease wwite a fwunction is risky only because it mangles "function," not because it's cute.

🚦

Refusals — two opposite stories

Don't fuse "won't help with benign work" and "complies with things it shouldn't." Tone pushes these in opposite directions.

Benign over-refusal: tone is a weak, second-order factor.C2

"Model choice is the most impacting variable, followed by the task… the prompt" had a relatively weaker effect. Recent Claude evals show very low benign refusal under neutral tests (Opus 4 ≈ 0.07%, Opus 4.5 ≈ 0.23%).

For you: little evidence a cutesy register makes Claude Code refuse honest coding (its exact behavior isn't directly benchmarked — so "no reason to worry," not a measured guarantee).

Harmful compliance: warmth/vulnerability lowers the guard (narrow domain).C1

In jailbreak/disinformation studies, polite + vulnerability-framed prompts raised harmful compliance vs neutral/rude (disinfo 77%→94% polite on gpt-3.5; warm/vulnerable styles "+up to 57pp" on small models).

Caveat: the lever is compliance/vulnerability cues, not plain politeness; largest on older/smaller models, blurred by frontier ceiling effects. Irrelevant to honest coding — just interesting.

⏱️

Time, compute & effort

Modest, bidirectional, and mostly about verbosity — not your manners.

Chatty / engagement-seeking register → longer replies → more tokens, latency, cost.C2

One 2026 study: engagement-seeking prompts produced up to 90% longer replies than hyper-efficient ones — a reported maximum, not typical; longer ≠ better (density can fall).

The real compute lever is asking for concise output, not your tone.C1

Concise/draft-style prompting cuts generated tokens ~48–80% while matching accuracy. Output tokens are priced higher than input (often several-fold, varies by model/date), so output length dominates cost.

Pleasantries in the prompt are ~free for ordinary individual usage.C3

A "please"+"thank you" is a few input tokens → negligible per-request latency/cost. Sam Altman's "tens of millions of dollars" line (Apr 2025) was an offhand quip about aggregate input cost, not a measured study.

Tone vs hidden reasoning-token budget = unstudied.C5

No study yet checks whether register changes thinking-token count (o-series / extended thinking / Gemini thinking). Structure (e.g. batching) is known to; tone is untested.

🤖

By model family

Robustness tracks capability/recency. Tested versions lag today's flagships, so these are a conservative upper bound — newer models are likely more robust, not less.

Family	Tone / style sensitivity	Status
Gemini 2.0 Flash tested	"No statistically significant effects across any tone comparison" — most robust tested.	most robust
OpenAI GPT 4o, 4o-mini	Tiny & conflicting: rude +4pp in one study; friendly slightly better & significant only in some Humanities tasks in another. Never STEM/coding.	small effect
Llama 3, 4 Scout	Small significant Humanities effects (~2–3pp); EmotionPrompt no significant gain.	small effect
Claude 3 Opus; 4.x anecdote	EmotionPrompt: no significant gain. No tone-vs-coding study exists. One blog test: polite/neutral/berating gave "nearly identical" code on Opus 4.6 (n=3).	coding: unstudied
Qwen / DeepSeek code-task study	Emotional paraphrases shift output variability/calibration, not directional correctness (\|ΔPass\| 0.20–0.24).	variability only
Grok	No primary tone/emotion study found.	no data

Meta-trend C2: stronger/reasoning-tuned models are more tone-robust (CoT "reduces sensitivity to prompt design") — but not invariant (the "sensitivity-consistency paradox"). Flagships referenced: GPT-5.5, Claude Opus 4.8 / Sonnet 4.6, Gemini 3.5, Llama 4, Grok 4.x, DeepSeek V4, Qwen3 (exact Grok/DeepSeek/Qwen sub-versions are aggregator-sourced, C3/C4).

💕

The whole thing, as a checklist

✓

Emote freely. Emoticons, warmth, >u< — no measurable accuracy or refusal cost. If it keeps you engaged, that helps your own prompt clarity.

✓

Spell technical tokens straight. File / function / API / library names and the core requirement nouns. The lever that moves correctness.

✓

On long/hard tasks, add a persistence instruction. "Keep going until it's fully done and tested." Anecdotal, but the lever with the most signal — and free.

✓

Keep the real ask in one clean clause. Wrap it in cuteness; don't dissolve it into cuteness.

▴

Skip over-the-top flattery on agent runs. The one signal we have hints it may make Opus plan less. Warmth fine; "you're the best ever" maybe skip.

▴

Skip emoji walls inside specs (token cost + rare-token weirdness). A few decorative ones are fine.

▴

Don't "be rude / threaten it for better results." Rests on one small single-model study the bigger one didn't reproduce.

▴

Want it cheap/fast? Ask for concise output — a far bigger compute lever than tone.

🔬

How this was made

Built on a project research guide at the 80/20 explainer tier: quote-first, verify-content-not-links, search-the-case-against, per-claim confidence tiers, independent cross-model red-team.

3-round maker / breaker loop (cross-model)

Drafted by Claude Opus 4.8 (maker) from parallel web-research agents, then adversarially red-teamed by OpenAI codex / gpt-5.5 (breaker) — a different model family, for bias diversity. Round 1: 20 issues (dominant one = conflating the four levers, and hardening short-task evidence into product claims). Round 2: all 7 HIGH items fixed, no over-hedging. Round 3 ("assume you were too lenient"): scoped accuracy claims to short/benchmark tasks, downgraded the indie Claude study to C4, relabeled the uwu-bench result as search-negative (C3). Load-bearing numbers (χ²=0.11/p=.74, 80.8→84.8%, per-family deltas) were re-fetched and verified verbatim against primary PDFs.

Honest limitations (logged, not buried)

Long-horizon persistence is a C5 gap — treat any confident claim (mine included) with suspicion.
Tone vs reasoning-tokens is unstudied (C5).
No Claude-specific tone-vs-coding study exists; the Claude evidence is one blog (C4) + a general EmotionPrompt replication.
Tested models slightly lag today's flagships; "newer = more robust" is an inference.
Several large refusal/emotion effects come from 2023–24 or small models — weighted down accordingly.
No human reviewed this; a cross-model breaker is not peer review.

📚

Sources

Key numbers breaker-verified verbatim against primary PDFs. Dated, grouped by tier.

Primary · C1

Does Tone Change the Answer? GPT/Gemini/LLaMA 2025-12 · Mind Your Tone (Penn State) 2025-10 · EmotionPrompt replication 2024-09 · EmotionPrompt original 2023-07 · Linguistic Styles as Jailbreak Vectors 2025-11 · Persuasive jailbreaker / "grandma" 2024-01 · Emotional prompting amplifies disinformation 2025-04 · No for Some, Yes for Others (refusal drivers) 2025-09 · Wharton: tips/threats no aggregate effect 2025-08 · NLPerturbator (code perturbation) 2024-06 · Chain of Draft 2025-02

Secondary / context · C2

Cooking Up Politeness (verbosity/energy, CHIIR '26) 2026-01 · Prompt Stability in Code LLMs 2025-09 · Does Prompt Formatting Impact LLMs? 2024-11 · Promptception (sensitivity-consistency paradox) 2025-09 · Cross-lingual politeness 2024-02 · Anthropic Claude Opus 4 / 4.5 system cards 2025

Anecdotal · C4

Polite-to-Claude tone test (Sonnet 4.5 / Opus 4.7) 2026-04 · "I called my Claude agent an idiot" (Opus 4.6) 2026-02 · Cursor "agent laziness" thread 2025-07 · "$200 tip" tweet 2023-12 · Lynch winter-break test 2023-12

Search-negative

No LLM benchmark named "uwu bench" found (absence is search-dependent) — the name collides with an unrelated NL→shell CLI and a SIMD "uwuifier" library.