The Real Story Behind OpenEvidence’s “100% on USMLE”

Supreet Deshpande

Exams test knowledge on paper, not performance in practice. Healthcare demands reproducibility, safety, and responsibility.

This past weekend, our team at SynthioLabs decided to do a little experiment. News had broken that OpenEvidence had achieved something extraordinary: their AI system had reportedly scored one hundred percent on the United States Medical Licensing Exam (USMLE). For anyone working at the intersection of medicine and artificial intelligence, that is a headline designed to grab attention. It certainly grabbed ours.

Naturally, curiosity got the better of us. We asked ourselves a simple question: could we replicate this? Could our own systems hit the same benchmark? So we set up the test. What followed was more than fifty runs, more than a thousand dollars’ worth of compute, and more than a few cups of coffee fueling the process. Eventually, after a lot of trial and error, we managed to get one run that reached the fabled score of one hundred percent.

At first glance, that might sound like victory. But for us, the feeling was less triumph than skepticism. We knew immediately that the result was not reproducible. One of our engineers even remarked, “I bet they can’t reproduce the 100 percent themselves.” That comment stuck with me.

To put the theory to the test, we went straight to OpenEvidence and asked it one of the very same USMLE questions. This time the outcome was very different. The system got it wrong. And in that moment the shine of the perfect score dimmed. Because if a system can only sometimes hit perfection, what does the claim of perfection really mean?

This is not just a question of academic interest. In healthcare, words carry enormous weight. If a physician sees “100 percent on USMLE,” it can create the illusion of reliability and mastery. The risk is that clinicians begin to trust the system blindly, assuming that an AI which can ace an exam is ready to guide diagnosis or treatment. That is a dangerous leap. Lives are not safeguarded by headlines.

The truth is that exams like the USMLE are designed for people, not for machines. They test knowledge in a certain structured way, but they are not designed to evaluate how a system will perform in the messy, ambiguous, and high-stakes environment of real-world medicine. A one-time perfect score, achieved after dozens of tries, is more parlor trick than clinical breakthrough.

At SynthioLabs, we have taken a different view from the very beginning. For us, the measure of success is not whether our system can ace an exam. It is whether clinicians, patients, and medical affairs teams can trust our system to provide reliable support, every single time they turn to it. That trust is built on three pillars: safety, credibility, and usability. Every answer we deliver is grounded in validated, regulatory-approved sources. Every interaction is designed to support medical information needs without crossing into diagnosis. And above all, every part of our architecture is built with compliance in mind, because responsibility matters more than buzz.

None of this is meant to diminish the impressive work being done by the team at OpenEvidence. We respect their ambition and their achievements. But in a field as sensitive as medicine, it is important to separate marketing headlines from clinical reality. If a claim cannot be reproduced, it should not be the basis on which trust is built.

The more urgent questions for our field are these: can an AI system consistently reproduce its performance, or is it a matter of chance? Can it explain its reasoning in a way that a clinician finds transparent and trustworthy? Can it handle the unstructured, messy data that defines real practice rather than test preparation? And above all, can it operate safely within the guardrails of medical regulation?

That is the exam that matters. And it is far harder than the USMLE.

When we finally got our “perfect score,” it was a reminder of how tempting shortcuts can be in this industry. But medicine does not reward shortcuts. It demands rigor, humility, and responsibility. At SynthioLabs, that is the standard we hold ourselves to, and it is the standard by which all of us building AI for healthcare should be judged.

Because at the end of the day, one hundred percent on the USMLE is a neat trick. But safe, credible, usable AI in the hands of healthcare professionals—that is the real breakthrough.

‍

The Real Story Behind OpenEvidence’s “100% on USMLE”

Have questions? We're here to help!

Contact Us

Thank you! Our team will follow up with details soon.

Have questions?  
We're here to help!