BioMysteryBench Is the Kind of Benchmark That Makes AI Doing Real Science Feel Less Like a Fantasy and More Like a Problem Set
Anthropic says BioMysteryBench contains 99 expert-written bioinformatics questions, with 76 judged human-solvable. It reports that Claude Mythos Preview reached a 30% solve rate on human-difficult tasks, making the whole “AI can't handle messy science” line look shakier.
The headline is sharp because the result is sharp: the moment frontier models start solving meaningful fractions of messy bioinformatics tasks built from real data, the old “AI can only ace clean toy benchmarks” argument starts losing some of its comfort.
Anthropic’s April 29, 2026 post on BioMysteryBench deserves attention because it does something the AI world badly needs: it moves beyond polished exam-style tasks and into domain work that looks more like real scientific problem-solving.
Anthropic says BioMysteryBench contains:
- 99 expert-written bioinformatics questions
- 76 tasks judged to be human-solvable
- a 30% solve rate on human-difficult problems for Claude Mythos Preview
Those numbers are not proof that AI has “solved science.” They are more useful than that. They show that frontier models are becoming hard to dismiss in at least one serious scientific workflow.
Why this benchmark is more interesting than most
A lot of AI benchmarks are clean, tightly framed, and easier to score than real professional work. That is useful, but it creates a dangerous habit: people start confusing benchmark comfort with real-world readiness.
BioMysteryBench pushes against that by grounding problems in messy bioinformatics data and objective properties of the data itself. That means the model has to do more than regurgitate familiar facts. It has to navigate the kind of ambiguity scientists actually live with.
That is a much tougher test.
The 99 and 76 numbers matter because they show restraint
One reason this benchmark is credible is that the authors did not pretend every question was automatically fair. Anthropic says experts built 99 questions, but only 76 ended up in the set considered human-solvable.
That matters because benchmark discipline is what keeps a result from turning into marketing wallpaper.
It tells readers:
- the dataset was curated with care
- human solvability was tested
- the benchmark is trying to measure capability, not manufacture hype
Ironically, that makes the hype around the result more justified.
The 30% number is where the anxiety starts
On human-difficult tasks, Claude Mythos Preview reportedly reached a 30% solve rate.
Some people will say, “Only 30%?” That is exactly the wrong instinct.
In scientific work, the right question is not whether AI is perfect. It is whether AI is crossing the line from useless curiosity to serious collaborator.
At 30% on hard tasks built from real bioinformatics complexity, the answer is increasingly yes.
That does not remove scientists. It changes the baseline expectation for what frontier models can contribute.
Why this kind of result has strong click power
Readers are naturally drawn to stories where AI crosses into prestigious, difficult human territory. Science is one of the strongest versions of that theme. But to earn trust, the story needs more than swagger.
BioMysteryBench works because it provides:
- a real dataset size
- a filtered human-solvable subset
- a model performance number that is impressive without sounding fake
That is enough to support a bolder narrative without lying to the audience.
The blunt takeaway
BioMysteryBench is the kind of benchmark that makes “AI in science” feel much more concrete. 99 questions, 76 human-solvable tasks, and a 30% solve rate by Claude Mythos Preview on human-difficult problems do not mean machines are replacing scientists. They do mean the floor for scientific AI capability is rising in a way that should make a lot of skeptics update their script. The fantasy phase is ending. The messy-evidence phase has arrived.