Tone barely moves accuracy on capable models (short/benchmark tasks). Emote freely.
Cutesy tone won't make it refuse honest work. (Warmth does aid jailbreaks — different story.)
Nobody's measured tone vs persistence rigorously. Best moves are below.
Long-running tasks — your real question
You've said twice this is what you care about, so it goes first. The honest version isn't tidy, but it's actionable.
No one has measured it rigorously. C5 Every controlled tone study uses short, single-turn tasks; the long-horizon agent benchmarks that exist don't vary tone. We found no work crossing the two — so anyone who tells you "being nice makes it finish hard jobs" (or "be harsh or it slacks") is guessing.
✎ So what actually helps on a long task
Why this is confusing: "tone" is four things
The whole topic gets muddled because people fuse four interventions that have opposite evidence. Keep them apart and everything clicks.
Everyday accuracy & coding
The most-studied dimension — and the most reassuring.
On short/benchmark tasks, tone has at most a small, direction-unstable effect on accuracy — and it largely vanishes on capable models / when aggregated.C2
Two 2025 studies disagree even on direction: one found rude helped GPT-4o (80.8%→84.8%); the larger multi-family study found friendly/neutral slightly better, significant only in scattered Humanities tasks, never STEM/coding. Net: a few points, unreliable, gone under aggregation.
The 2023 "be emotional → +115%" result largely didn't replicate on modern models.C1
A direct replication across GPT-4o, Claude 3 Opus, Gemini 1.5 Pro & Llama 3 found a non-significant +1% (χ²=0.11, p=.74); the famous "+115%" was a cherry-picked best-of-eleven prompt (honest average ≈ 2.6%).
What actually degrades code: corrupting content tokens, not affect.C2
Cosmetic typos on filler words cost ~1–2%; semantic corruption — misspelling a function/API name,
rephrasing the requirement — drove the worst drops (up to ~21% on weak models). So
pwease wwite a fwunction is risky only because it mangles "function," not because it's cute.
Refusals — two opposite stories
Don't fuse "won't help with benign work" and "complies with things it shouldn't." Tone pushes these in opposite directions.
Benign over-refusal: tone is a weak, second-order factor.C2
"Model choice is the most impacting variable, followed by the task… the prompt" had a relatively weaker effect. Recent Claude evals show very low benign refusal under neutral tests (Opus 4 ≈ 0.07%, Opus 4.5 ≈ 0.23%).
For you: little evidence a cutesy register makes Claude Code refuse honest coding (its exact behavior isn't directly benchmarked — so "no reason to worry," not a measured guarantee).
Harmful compliance: warmth/vulnerability lowers the guard (narrow domain).C1
In jailbreak/disinformation studies, polite + vulnerability-framed prompts raised harmful compliance vs neutral/rude (disinfo 77%→94% polite on gpt-3.5; warm/vulnerable styles "+up to 57pp" on small models).
Caveat: the lever is compliance/vulnerability cues, not plain politeness; largest on older/smaller models, blurred by frontier ceiling effects. Irrelevant to honest coding — just interesting.
Time, compute & effort
Modest, bidirectional, and mostly about verbosity — not your manners.
Chatty / engagement-seeking register → longer replies → more tokens, latency, cost.C2
One 2026 study: engagement-seeking prompts produced up to 90% longer replies than hyper-efficient ones — a reported maximum, not typical; longer ≠ better (density can fall).
The real compute lever is asking for concise output, not your tone.C1
Concise/draft-style prompting cuts generated tokens ~48–80% while matching accuracy. Output tokens are priced higher than input (often several-fold, varies by model/date), so output length dominates cost.
Pleasantries in the prompt are ~free for ordinary individual usage.C3
A "please"+"thank you" is a few input tokens → negligible per-request latency/cost. Sam Altman's "tens of millions of dollars" line (Apr 2025) was an offhand quip about aggregate input cost, not a measured study.
Tone vs hidden reasoning-token budget = unstudied.C5
No study yet checks whether register changes thinking-token count (o-series / extended thinking / Gemini thinking). Structure (e.g. batching) is known to; tone is untested.
By model family
Robustness tracks capability/recency. Tested versions lag today's flagships, so these are a conservative upper bound — newer models are likely more robust, not less.
| Family | Tone / style sensitivity | Status |
|---|---|---|
| Gemini 2.0 Flash tested |
"No statistically significant effects across any tone comparison" — most robust tested. | most robust |
| OpenAI GPT 4o, 4o-mini |
Tiny & conflicting: rude +4pp in one study; friendly slightly better & significant only in some Humanities tasks in another. Never STEM/coding. | small effect |
| Llama 3, 4 Scout |
Small significant Humanities effects (~2–3pp); EmotionPrompt no significant gain. | small effect |
| Claude 3 Opus; 4.x anecdote |
EmotionPrompt: no significant gain. No tone-vs-coding study exists. One blog test: polite/neutral/berating gave "nearly identical" code on Opus 4.6 (n=3). | coding: unstudied |
| Qwen / DeepSeek code-task study |
Emotional paraphrases shift output variability/calibration, not directional correctness (|ΔPass| 0.20–0.24). | variability only |
| Grok | No primary tone/emotion study found. | no data |
Meta-trend C2: stronger/reasoning-tuned models are more tone-robust (CoT "reduces sensitivity to prompt design") — but not invariant (the "sensitivity-consistency paradox"). Flagships referenced: GPT-5.5, Claude Opus 4.8 / Sonnet 4.6, Gemini 3.5, Llama 4, Grok 4.x, DeepSeek V4, Qwen3 (exact Grok/DeepSeek/Qwen sub-versions are aggregator-sourced, C3/C4).
The whole thing, as a checklist
>u< — no measurable accuracy or refusal cost. If it keeps you engaged, that helps your own prompt clarity.How this was made
Built on a project research guide at the 80/20 explainer tier: quote-first, verify-content-not-links, search-the-case-against, per-claim confidence tiers, independent cross-model red-team.
3-round maker / breaker loop (cross-model)
Drafted by Claude Opus 4.8 (maker) from parallel web-research agents, then adversarially red-teamed by OpenAI codex / gpt-5.5 (breaker) — a different model family, for bias diversity. Round 1: 20 issues (dominant one = conflating the four levers, and hardening short-task evidence into product claims). Round 2: all 7 HIGH items fixed, no over-hedging. Round 3 ("assume you were too lenient"): scoped accuracy claims to short/benchmark tasks, downgraded the indie Claude study to C4, relabeled the uwu-bench result as search-negative (C3). Load-bearing numbers (χ²=0.11/p=.74, 80.8→84.8%, per-family deltas) were re-fetched and verified verbatim against primary PDFs.
Honest limitations (logged, not buried)
- Long-horizon persistence is a C5 gap — treat any confident claim (mine included) with suspicion.
- Tone vs reasoning-tokens is unstudied (C5).
- No Claude-specific tone-vs-coding study exists; the Claude evidence is one blog (C4) + a general EmotionPrompt replication.
- Tested models slightly lag today's flagships; "newer = more robust" is an inference.
- Several large refusal/emotion effects come from 2023–24 or small models — weighted down accordingly.
- No human reviewed this; a cross-model breaker is not peer review.
Sources
Key numbers breaker-verified verbatim against primary PDFs. Dated, grouped by tier.
Does Tone Change the Answer? GPT/Gemini/LLaMA 2025-12 · Mind Your Tone (Penn State) 2025-10 · EmotionPrompt replication 2024-09 · EmotionPrompt original 2023-07 · Linguistic Styles as Jailbreak Vectors 2025-11 · Persuasive jailbreaker / "grandma" 2024-01 · Emotional prompting amplifies disinformation 2025-04 · No for Some, Yes for Others (refusal drivers) 2025-09 · Wharton: tips/threats no aggregate effect 2025-08 · NLPerturbator (code perturbation) 2024-06 · Chain of Draft 2025-02
Cooking Up Politeness (verbosity/energy, CHIIR '26) 2026-01 · Prompt Stability in Code LLMs 2025-09 · Does Prompt Formatting Impact LLMs? 2024-11 · Promptception (sensitivity-consistency paradox) 2025-09 · Cross-lingual politeness 2024-02 · Anthropic Claude Opus 4 / 4.5 system cards 2025
Polite-to-Claude tone test (Sonnet 4.5 / Opus 4.7) 2026-04 · "I called my Claude agent an idiot" (Opus 4.6) 2026-02 · Cursor "agent laziness" thread 2025-07 · "$200 tip" tweet 2023-12 · Lynch winter-break test 2023-12
No LLM benchmark named "uwu bench" found (absence is search-dependent) — the name collides with an unrelated NL→shell CLI and a SIMD "uwuifier" library.