Advent of Slop

Advent of Slop is a benchmark study designed to test how the current generation of Large Language Models (LLMs) performs nearly autonomously on tasks described in natural language. By providing models with raw, multi-part problem descriptions from Advent of Code 2025, the project evaluates their ability to reason through challenging problems across different programming languages and produce executable code without human intervention. The focus is on measuring correctness, variance across multiple passes, and cost-efficiency of models ranging from "nano" to "Opus" tiers.

Introduction

The idea for Advent of Slop was born from my yearly dilemma: wanting to participate in Advent of Code but being too busy to actually do it. Given personal time constraints and the inescapable hype around LLMs in software engineering, it was a no-brainer to take on this project to spice up the holidays.

The premise is simple: ask models to "act like a developer." This means managing the full process autonomously: reading a wall of text, figuring out that there may be two parts to the problem, handling file I/O, and printing numeric answers.

Methodology

1. The Dataset

The evaluation uses a structured dataset containing the full problem descriptions and verified solutions for the 2025 calendar. Each day consists of:

The Problem: The full text (Part 1 and Part 2) as seen on the AoC website.
The Input: Unique puzzle inputs stored as .txt files.
The Ground Truth: Expected numeric or string answers to validate the model's output.

2. The Contenders

I selected a lineup of the latest models available as of late 2025, covering three major providers via LiteLLM:

Provider	Models Evaluated
OpenAI	`gpt-5.2`, `gpt-5-mini`, `gpt-5-nano`
Anthropic	`claude-opus-4.5`, `claude-sonnet-4.5`, `claude-haiku-4.5`
Google	`gemini-3-pro-preview`, `gemini-3-flash-preview`

3. The Languages

The selection includes popular programming languages balancing interpreted vs. compiled, strict vs. flexible syntax, and imperative vs. functional paradigms. A key constraint was avoiding excessive verbosity to ensure responses fit within context windows — particularly challenging for smaller models like GPT-5 nano. This led to choosing JavaScript over TypeScript and dropping candidates like Java and C#.

Python
Rust
Go
JavaScript
C++
C
R

4. The Execution Pipeline

The project uses a custom orchestrator built in Python (src/evaluator.py) and a multi-language execution engine (src/executor.py).

Prompting: Models receive a consistent system prompt and the full puzzle text. They must output code that reads input from a file path passed as sys.argv[1].
Generation: The model generates code in one of the supported languages.
Execution: The Executor handles compilation (for Rust, C, C++) or direct interpretation (Python, JS, R) and runs the code against the puzzle input with a 30-second timeout.
Validation: The script captures stdout and compares it to the ground truth.

Results

Overall Model Accuracy Ranking

Mean accuracy averaged across all programming languages, problems, and passes.

Gemini 3 Pro leads the rankings at 70.4%, though no model achieves what might be considered "reliable" performance. The biggest surprises are GPT-5 mini and Gemini 3 Flash, which come remarkably close to their larger counterparts, perhaps suggesting that for this type of task, the mid-tier models offer compelling cost-performance tradeoffs.

Performance by programming language

This breakdown observes which models are generalists and which specialize in particular languages.

Python: Model Accuracy Ranking

Mean accuracy in Python averaged across all problems and passes.

Several patterns emerge from the per-language breakdown:

Gemini 3 Pro dominates interpreted languages, hitting 90.3% on JavaScript and 87.5% on Python. This suggests strong training data coverage for these ubiquitous languages.
GPT-5 mini punches above its weight in Python (72.2%) and JavaScript (73.6%), actually outperforming GPT-5.2 in these languages.
Claude models show surprising weakness in Python, with Opus at just 44.4% compared to its stronger performance in Go (63.4%) and Rust (62.5%).
R is universally challenging, with even the best performer (Gemini 3 Pro) only reaching 58.3%. The smaller models essentially fail completely.
Nano-tier models collapse on compiled languages: GPT-5 nano scores 0% on both Rust and C, suggesting these models lack the capacity to handle strict type systems and compilation requirements.

Cross-language comparison

Model Performance Across Languages

Accuracy percentage for each model-language combination. Brighter shades indicate higher performance.

The heatmap reveals clear patterns in model-language affinities:

JavaScript and Python form a "safe zone" with generally higher scores across the board (brighter colors concentrated in these columns).
C is the great equalizer: even top models struggle, with scores clustering in the 47-62% range regardless of model size.
GPT-5 mini has a notable blind spot in Go (29.2%), performing worse than even Claude Haiku in this language. This anomaly warrants further investigation.
Claude models show remarkable consistency across languages, with less variance between their best and worst languages compared to the Gemini models.
Gemini 3 Flash outperforms Gemini 3 Pro in C (62.5% vs 47.2%), one of the few cases where a smaller model significantly beats its larger sibling.

Variance and failure analysis

Do the averages hide brittleness? This chart shows the standard deviation across all runs for each model.

Model Accuracy Variance Across All Conditions

Mean accuracy across all languages, problems, and passes. Error bars show ±1 standard deviation across all evaluation outcomes.

The variance analysis reveals important insights about model reliability:

Claude Opus shows the lowest variance (±9.0%) among competitive models, making it the most predictable performer despite not having the highest mean accuracy.
GPT-5 mini is the most volatile mid-tier model at ±20.3%, meaning its performance swings wildly between excellent and poor depending on the problem and language.
Gemini 3 Pro's high mean comes with high variance (±17.8%), suggesting its top scores are offset by occasional significant failures.
GPT-5 nano's low variance is misleading—it simply fails consistently, with a tiny standard deviation around a near-zero mean.

For production use cases where reliability matters, Claude Opus's combination of decent accuracy (53.6%) and low variance makes it an interesting choice over higher-scoring but less predictable alternatives — this obviously comes at a significant cost.

Discussion

The multi-language challenge

One of the most interesting aspects of this project is the multi-language requirement. Compiled languages like Rust and C++ definitely added complexity: models must not only solve the logic but also produce code that passes the compiler: no missing semicolons, no type mismatches, no forgotten imports. This greatly increased the effort to improve stability to a point worth evaluating.

Security considerations

Executing LLM-generated code locally is inherently risky. For this experiment, I accepted the risk given the controlled scope. Sandboxed execution with timeouts and no network access. Needless to say that any production implementation would require proper isolation.

Temperature sensitivity

Temperature significantly affected performance, particularly for Claude models, e.g. Opus performed notably better with temperature left at the default (unspecified) rather than explicitly set to 0. In general, it seems like the better option is to leave temperature unspecified and rather tweak reasoning parameters instead.

The consistency struggle

Achieving consistent responses required far more effort than anticipated. This created a tension: the challenge was testing autonomous LLM problem-solving, so I had to carefully calibrate how much "help" to provide via prompting while keeping the evaluation fair. For this reason, I specifically didn't want to use structured outputs.

The system prompt went through several iterations to achieve consistent output formatting across all languages. Despite this, Rust, JavaScript, and C++ still required language-specific instructions. Models with chain-of-thought responses struggled most with consistency: Gemini 3 Pro would sometimes output verbose explanation chains instead of the requested format, sometimes ending up only partially answering the problem.

Reasoning effort

Does allocating more "thinking tokens" improve performance? The reasoning_effort parameter controls how many tokens models can use for internal reasoning.

reasoning_effort	budget_tokens
none	0
low	1024
medium	2048
high	4096

Note: Claude Sonnet 4.5 and Claude Opus 4.5 do not support reasoning_effort set to none.

Impact of reasoning effort on accuracy

Mean accuracy averaged across all problems and selected languages (Python, Rust, C++).

The results challenge the assumption that more thinking always helps:

GPT-5.2 performs best at "low" reasoning (81.9%) and actually degrades significantly at "high" (56.9%). Over-thinking appears counterproductive.
Claude Opus 4.5 benefits from more reasoning, improving from 61.1% at low to 70.8% at high. The only model showing clear gains from increased reasoning budget.
Gemini 3 Pro peaks at "none" (84.7%) and steadily declines with more reasoning tokens. This may relate to its tendency toward verbose chain-of-thought outputs interfering with clean code generation.
Claude Sonnet 4.5 shows inconsistent behavior, peaking at "medium" (41.7%) before dropping at "high" (38.9%).

The takeaway: reasoning effort is not a simple "more is better" parameter. Optimal settings vary by model and likely by task type. For straightforward coding challenges, minimal reasoning often suffices. Higher reasoning efforts, especially for larger and more reasoning-oriented models, mostly causes models to produce lengthy chain-of-thought outputs, interfering with clean code generation, eventually leading to partially answered problems, response parsing errors or code that cannot be executed or compiled. Smaller models simply answer as requested. This may suggest larger, reasoning-oriented models, are specifically fine-tuned to answer using chain-of-thought responses.

Needless to say, that implementing structured outputs may solve this issue and improve overall performance, however this was not the goal for this experiment and it therefore remains to be explored further.

Other observations

Several additional patterns emerged during the evaluation:

External library dependencies: Models frequently tried to import external libraries despite explicit instructions to use only the standard library. This required prompt engineering to address.
Output format inconsistency: Gemini 3 Pro would sometimes output unwanted JSON structures with "part" and "solution" keys instead of the requested plain format, even with temperature at 0.
Run-to-run variance: Overall accuracy varied by approximately ±5% between runs. Gemini models stabilized significantly after prompt tweaks, while Claude models were more sensitive to temperature settings.

Conclusion

Advent of Slop reveals that current LLMs can autonomously solve simple programming challenges, but not reliably. The best performer, Gemini 3 Pro, achieves only 70.4% accuracy, and that comes with significant variance.

Key findings:

Model size isn't everything. Mid-tier models (GPT-5 mini, Gemini 3 Flash) often approach or match their larger counterparts in one-shot tasks.
Language matters. Python and JavaScript see much higher success rates than C, R, or Rust.
Consistency beats peak performance. Claude Opus's low variance may be more valuable than Gemini 3 Pro's higher but erratic scores, depending on use case.
Reasoning effort requires tuning. More thinking tokens don't automatically improve results, sometimes the opposite.
Prompt engineering remains essential. Significant effort went into achieving consistent, parseable outputs across models and languages.

The experiment also surfaced the fundamental tension in evaluating coding ability of LLMs: how much help is fair? Adjusting prompts for better consistency feels a bit like cheating, but leaving models to fail on formatting seems to miss the point. There's no clean answer.

Finally, always be skeptical and take this with a grain of salt, it was a fun and simple experiment but no real conclusion can be taken. For the curious, the code of the project is available at divin1/advent-of-slop.