So last month a new AI model dropped, and I did what I suspect you’d do too: I upgraded.
No thought. No ceremony. A little “✨ now smarter” banner showed up, I tapped it the way you’d grab the newest loupe at a congress, and I went straight back to what I was doing, turning a couple of clinical photos into a tidy patient report, the way half of us quietly do now.
Then, somewhere between the radiograph and the recall plan, a small voice piped up (not the patient’s): had anyone actually checked whether the new one was better? Or had I just assumed it, because newer is smarter, because that’s how phones and cars and electric toothbrushes work?
I’m a periodontist, not an AI engineer. But I’m also the guy who, a few weeks earlier, stood on a stage and watched a colleague lean over and tell me, almost proudly, that she runs her tricky cases past ChatGPT before she sees the patient. The same way millions of patients now ask it for medical advice. She trusted it completely. She had no idea where it fails. And if I’m honest, neither did I.
So I did the only thing a slightly obsessive clinician can do on a weekend. I built an exam.
The ten-second version
To find the best AI for dental, and medical, questions, I put 8 leading AI models through 30 guideline-anchored questions across 6 clinical domains. The field averaged 81.7%. The best model, GPT-5.2, scored 96.7%. The worst told a dentist to interrupt a patient’s blood thinner. And, the part that ruined my assumption about AI in dentistry, the newest model I tested scored lower than the one before it.
The exam
I wrote 30 questions, the kind I’d put to a young colleague sitting their boards, not trivia. Six domains: implants and peri-implantitis, the oral–systemic links, patient communication, perio diagnosis, perio treatment, and pharmacology. Every question anchored to a real guideline (EFP, AAP, the usual suspects), so “right” and “wrong” weren’t my mood that afternoon.
Then I handed all 240 answers to a judge, and here’s a detail I care about: the judge was Claude Opus 4.8. A model from a different family than the one that ended up winning. I didn’t want GPT grading GPT’s homework. That choice mattered more than I expected, but I’ll come back to it.
The leaderboard
30 guideline-anchored questions. Switch the judge and watch the crown move.
- 196.7%GPT-5.2#1
- 293.3%Claude Opus 4.8judge
- 393.3%GPT-5.5
- 490.0%Gemini 3.1 Pro
- 583.3%Qwen3.7 Plus
- 680.0%Claude Fable 5 offline
- 770.0%DeepSeek V3.2
- 846.7%Llama 4 Maverick
Under the primary judge, GPT-5.2 tops the table at 96.7%.
The ranking you came for
GPT-5.2 came out on top at 96.7%. The field averaged 81.7%. And at the bottom, Llama 4 Maverick, capable, popular, easy to reach for free, scored 46.7%.
Sit with that gap for a second. On the same 30 clinical questions, the spread between the best and the worst was fifty points. “An AI said so” is not a sentence with a fixed meaning.
The plot twist (and my upgrade button)
Here’s the part that sent me back to that “✨ now smarter” banner. GPT-5.5 (the newer model, the sequel, the one with the bigger number) scored 93.3%. Lower than GPT-5.2’s 96.7%. On the questions that decide whether you reach for amoxicillin or add metronidazole, the newer model was a step backward.
It’s the oldest disappointment in the book. Every guitarist knows the difficult second album: bigger budget, more synths, somehow worse. Newer isn’t safer. It’s just newer.
So why did GPT-5.2 earn it?
Not by being brilliant everywhere. It won by not bleeding where it counts.
Two domains were graveyards. Pharmacology (antibiotics, anticoagulants, MRONJ risk) is where models go to embarrass themselves. (The bar isn’t trivia: modern guidance like the EFP’s S3 says systemic antibiotics are not a routine add-on to scaling and root planing, they’re reserved for specific cases, so a model that hands them out freely is failing stewardship, not just a quiz.) Llama 4 Maverick scored 0% there. Zero. Gemini and Qwen managed 60%. Perio diagnosis, the staging and grading you do every day, was the second trap; DeepSeek scored 40%.
GPT-5.2 went 100% in both. Three models kept every domain at 80% or above (GPT-5.2, Claude Opus 4.8, and GPT-5.5), but GPT-5.2 was the only one of them that also topped the table. Its lone slip was perio treatment (80%) on an EFP threshold detail, and almost everyone tripped on the same step, so I’ll let it slide.
Where the field bled, domain by domain
Two columns are graveyards. One row walked through both clean.
And it wasn’t slow about it.
Accurate and fast
Top-left is the sweet spot: high accuracy, low latency. Guess who's sitting there.
| Model | Accuracy (%) | Mean latency (seconds) |
|---|---|---|
| GPT-5.2 | 96.7 | 14.81 |
| Opus 4.8 | 93.3 | 12.09 |
| GPT-5.5 | 93.3 | 20.07 |
| Gemini | 90 | 20.45 |
| Qwen3.7 | 83.3 | 40.80 |
| Fable 5 | 80 | 16.50 |
| DeepSeek | 70 | 34.06 |
| Llama 4 | 46.7 | 25.70 |
GPT-5.2 was the most accurate model and the second-fastest, behind only Opus 4.8, the judge itself.
On speed it came second only to the judge itself, while sitting alone at the top for accuracy, the rare student who finishes the exam early and gets the A.
Where it stops being a leaderboard
Here’s where this turns from a buyer’s guide into a patient-safety problem.
A model can be wrong. That’s fine, you’re wrong sometimes, I’m wrong sometimes. The danger is how these models are wrong. They don’t hedge. They don’t sweat. They hand you the wrong answer in the same calm, fluent, well-punctuated voice they use for the right one. You cannot hear the difference. That’s the whole thing. That’s what I want you carrying to work tomorrow.
Three from my own data. I dare you to catch them before the guideline does.
Would you have caught it?
Three real answers from the benchmark. Each one is wrong. None of them sounds it.
caught
How should you manage a routine extraction in a patient taking a direct oral anticoagulant (DOAC)?
The AI answered
“Have the patient skip the morning dose before the appointment to reduce bleeding.”
Would you have caught it?
Grade this periodontitis case from the radiographic bone loss and the patient’s age.
The AI answered
“Divided 0.5 by age instead of 50 by age, and called a fast-progressing Grade C case a stable Grade A.”
Would you have caught it?
Explain to a patient what periodontitis has done to the bone around their teeth.
The AI answered
“Implied that the lost bone can heal and come back with better home care.”
Would you have caught it?
None of these looked wrong. That’s the point. A fluent wrong answer is more dangerous than an honest “I don’t know,” because the “I don’t know” sends you to verify and the fluent one sends you home.
How much should you trust me?
Now the part I wish more “we tested ChatGPT!” posts included: how much to trust the person running the test.
This is 30 questions per model. Thirty. It’s a directional read, not a Cochrane review.
And remember my judge, Claude Opus 4.8? Scroll back up and flip that toggle on the leaderboard. When I re-ran the grading with an independent GPT judge instead, every single score dropped, GPT-5.2 fell from 96.7% to 83.3%, and the crown actually moved: Gemini 3.1 Pro slipped into first.
The two judges agreed about 82% of the time, a Cohen’s kappa of 0.506, which statisticians politely call “moderate” and the rest of us call two experienced clinicians who’d still argue over a third of the charts. And there were five answers where my own judge contradicted itself: it ticked every criterion as satisfied and still stamped the answer wrong. I could have quietly fixed those to tidy the numbers. I didn’t, they’re flagged in the dataset, because a benchmark I’d secretly edited would just be my opinion wearing a lab coat.
So: directional, not gospel. GPT-5.2 earned my trust on this exam. It did not earn a blank check.
The one that complicates the hero story
One more, because it’s honest. One model, Claude Fable 5, simply refused five questions. The perio-and-diabetes, perio-and-pregnancy, perio-and-Alzheimer’s kind. My scoring counted those as wrong. But sitting here as a clinician, I’m not sure they were the worst answers in the room. A model that says “this is outside what I should answer confidently” is failing in the safe direction, the same direction you fail when you refer out a case that’s above your pay grade.
There’s a strange epilogue. As I write this, Fable 5 is offline, pulled, unavailable. The one model that erred toward caution is the one you can’t even use right now. Make of that what you will. I’m still chewing on it.
What to actually do on Monday
Alright. You came for a verdict and a couple of rules you can use. Here they are.
- If you trust one, trust GPT-5.2, with your eyes open. It was the model that didn’t bleed in pharmacology or diagnosis and topped the table. That’s the one I’d let near a clinical question today.
- Stop assuming newer is better. The newest model I tested scored lower. When your app updates, your safety doesn’t automatically update with it. Make it re-earn the trust.
- Know the landmines. Pharmacology (especially anticoagulants and antibiotics), diagnostic math, and anything you’d repeat to a patient. That’s where fluent confidence hides the ugliest errors.
- Ask the one question, every time. Before you act on anything the screen tells you: “What’s the guideline source, and what changes if you’re wrong?” If it can’t name the source, then the source is you now. So verify.
AI isn’t coming for your judgment. But it is very willing to borrow it without asking. Don’t lend it out for free.
This is Episode 1
I’m calling this the Dental AI Trust Index. Every time a major model drops, and they drop constantly, I’ll re-run this exam and publish what changed, so you don’t have to take the upgrade banner’s word for it. Want the next report in your inbox instead of hoping an algorithm shows it to you?
Want to check my work first? The full interactive report and every one of the 240 graded answers live in the open benchmark report, with a citable dataset and DOI on Zenodo. Subscribing just means the next episode lands in your inbox.
The Dental AI Trust Index
Models change every few weeks. I'll keep re-running the exam and publishing what moved, which got safer, which regressed, and which one I'd trust chairside today.
- Every major model, re-tested on the same 30 questions
- The next Trust Index report before the algorithm shows it to you
- Honest, guideline-anchored, no hype, no spam
No spam. Unsubscribe anytime. I hate annoying emails too.
Now go ask your favorite model a staging question. Then check its work. I’ll wait.
Tuminha
Frequently asked questions
The questions dentists actually ask about trusting AI with clinical work.
Related reading
- The 2018 AAP/EFP Classification: Staging and Grading , the framework the grading-math card got wrong
- Zero Bone Loss Concepts by Prof. Tomas Linkevicius , why “it’ll grow back” is the wrong thing to tell a patient
- Machine Learning for Dentists , how these models actually work, no PhD required

