AI Science 2026-05-27 3 min read

BioMysteryBench Is the Kind of Benchmark That Makes AI Doing Real Science Feel Less Like a Fantasy and More Like a Problem Set

Anthropic says BioMysteryBench contains 99 expert-written bioinformatics questions, with 76 judged human-solvable. It reports that Claude Mythos Preview reached a 30% solve rate on human-difficult tasks, making the whole “AI can't handle messy science” line look shakier.

The headline is sharp because the result is sharp: the moment frontier models start solving meaningful fractions of messy bioinformatics tasks built from real data, the old “AI can only ace clean toy benchmarks” argument starts losing some of its comfort.

Anthropic’s April 29, 2026 post on BioMysteryBench deserves attention because it does something the AI world badly needs: it moves beyond polished exam-style tasks and into domain work that looks more like real scientific problem-solving.

Anthropic says BioMysteryBench contains:

99 expert-written bioinformatics questions
76 tasks judged to be human-solvable
a 30% solve rate on human-difficult problems for Claude Mythos Preview

Those numbers are not proof that AI has “solved science.” They are more useful than that. They show that frontier models are becoming hard to dismiss in at least one serious scientific workflow.

Why this benchmark is more interesting than most

A lot of AI benchmarks are clean, tightly framed, and easier to score than real professional work. That is useful, but it creates a dangerous habit: people start confusing benchmark comfort with real-world readiness.

BioMysteryBench pushes against that by grounding problems in messy bioinformatics data and objective properties of the data itself. That means the model has to do more than regurgitate familiar facts. It has to navigate the kind of ambiguity scientists actually live with.

That is a much tougher test.

The 99 and 76 numbers matter because they show restraint

One reason this benchmark is credible is that the authors did not pretend every question was automatically fair. Anthropic says experts built 99 questions, but only 76 ended up in the set considered human-solvable.

That matters because benchmark discipline is what keeps a result from turning into marketing wallpaper.

It tells readers:

the dataset was curated with care
human solvability was tested
the benchmark is trying to measure capability, not manufacture hype

Ironically, that makes the hype around the result more justified.

The 30% number is where the anxiety starts

On human-difficult tasks, Claude Mythos Preview reportedly reached a 30% solve rate.

Some people will say, “Only 30%?” That is exactly the wrong instinct.

In scientific work, the right question is not whether AI is perfect. It is whether AI is crossing the line from useless curiosity to serious collaborator.

At 30% on hard tasks built from real bioinformatics complexity, the answer is increasingly yes.

That does not remove scientists. It changes the baseline expectation for what frontier models can contribute.

Why this kind of result has strong click power

Readers are naturally drawn to stories where AI crosses into prestigious, difficult human territory. Science is one of the strongest versions of that theme. But to earn trust, the story needs more than swagger.

BioMysteryBench works because it provides:

a real dataset size
a filtered human-solvable subset
a model performance number that is impressive without sounding fake

That is enough to support a bolder narrative without lying to the audience.

The blunt takeaway

BioMysteryBench is the kind of benchmark that makes “AI in science” feel much more concrete. 99 questions, 76 human-solvable tasks, and a 30% solve rate by Claude Mythos Preview on human-difficult problems do not mean machines are replacing scientists. They do mean the floor for scientific AI capability is rising in a way that should make a lot of skeptics update their script. The fantasy phase is ending. The messy-evidence phase has arrived.

Sources

Anthropic: Evaluating Claude for bioinformatics with BioMysteryBench

BioMysteryBench Is the Kind of Benchmark That Makes AI Doing Real Science Feel Less Like a Fantasy and More Like a Problem Set

Why this benchmark is more interesting than most

The 99 and 76 numbers matter because they show restraint

The 30% number is where the anxiety starts

Why this kind of result has strong click power

The blunt takeaway

Sources

Related guides

GPT-Rosalind Is the Kind of AI Science Launch That Makes Most “This Will Change Everything” Hype Sound Embarrassingly Cheap

ERA Is the Kind of AI Science-Agent Story That Makes Most Research Automation Talk Sound Like PowerPoint Fantasy Because Google Tested It in the Wild

ERA Is the Kind of Science Agent Breakthrough That Makes Empty AI Productivity Talk Feel Suspiciously Small