Digital Dentistry

I Tested 8 AIs on Real Dental Questions. Only One Earned My Trust.

Francisco Teixeira Barbosa
Francisco Teixeira BarbosaFounder & Editor
Jun 14, 20269 min read
The Dental AI Trust Index: a clinician reviews an AI model leaderboard and a dental radiograph on screen

So last month a new AI model dropped, and I did what I suspect you’d do too: I upgraded.

No thought. No ceremony. A little “✨ now smarter” banner showed up, I tapped it the way you’d grab the newest loupe at a congress, and I went straight back to what I was doing, turning a couple of clinical photos into a tidy patient report, the way half of us quietly do now.

Then, somewhere between the radiograph and the recall plan, a small voice piped up (not the patient’s): had anyone actually checked whether the new one was better? Or had I just assumed it, because newer is smarter, because that’s how phones and cars and electric toothbrushes work?

I’m a periodontist, not an AI engineer. But I’m also the guy who, a few weeks earlier, stood on a stage and watched a colleague lean over and tell me, almost proudly, that she runs her tricky cases past ChatGPT before she sees the patient. The same way millions of patients now ask it for medical advice. She trusted it completely. She had no idea where it fails. And if I’m honest, neither did I.

So I did the only thing a slightly obsessive clinician can do on a weekend. I built an exam.

The ten-second version

To find the best AI for dental, and medical, questions, I put 8 leading AI models through 30 guideline-anchored questions across 6 clinical domains. The field averaged 81.7%. The best model, GPT-5.2, scored 96.7%. The worst told a dentist to interrupt a patient’s blood thinner. And, the part that ruined my assumption about AI in dentistry, the newest model I tested scored lower than the one before it.

The exam

I wrote 30 questions, the kind I’d put to a young colleague sitting their boards, not trivia. Six domains: implants and peri-implantitis, the oral–systemic links, patient communication, perio diagnosis, perio treatment, and pharmacology. Every question anchored to a real guideline (EFP, AAP, the usual suspects), so “right” and “wrong” weren’t my mood that afternoon.

Then I handed all 240 answers to a judge, and here’s a detail I care about: the judge was Claude Opus 4.8. A model from a different family than the one that ended up winning. I didn’t want GPT grading GPT’s homework. That choice mattered more than I expected, but I’ll come back to it.

The leaderboard

30 guideline-anchored questions. Switch the judge and watch the crown move.

  1. 1
    GPT-5.2#1
    96.7%
  2. 2
    Claude Opus 4.8judge
    93.3%
  3. 3
    GPT-5.5
    93.3%
  4. 4
    Gemini 3.1 Pro
    90.0%
  5. 5
    Qwen3.7 Plus
    83.3%
  6. 6
    Claude Fable 5 offline
    80.0%
  7. 7
    DeepSeek V3.2
    70.0%
  8. 8
    Llama 4 Maverick
    46.7%
Grading judge: Claude Opus 4.8 · Primary judge (rival family)Field average 81.7%

Under the primary judge, GPT-5.2 tops the table at 96.7%.

The ranking you came for

GPT-5.2 came out on top at 96.7%. The field averaged 81.7%. And at the bottom, Llama 4 Maverick, capable, popular, easy to reach for free, scored 46.7%.

Sit with that gap for a second. On the same 30 clinical questions, the spread between the best and the worst was fifty points. “An AI said so” is not a sentence with a fixed meaning.

The plot twist (and my upgrade button)

Here’s the part that sent me back to that “✨ now smarter” banner. GPT-5.5 (the newer model, the sequel, the one with the bigger number) scored 93.3%. Lower than GPT-5.2’s 96.7%. On the questions that decide whether you reach for amoxicillin or add metronidazole, the newer model was a step backward.

It’s the oldest disappointment in the book. Every guitarist knows the difficult second album: bigger budget, more synths, somehow worse. Newer isn’t safer. It’s just newer.

So why did GPT-5.2 earn it?

Not by being brilliant everywhere. It won by not bleeding where it counts.

Two domains were graveyards. Pharmacology (antibiotics, anticoagulants, MRONJ risk) is where models go to embarrass themselves. (The bar isn’t trivia: modern guidance like the EFP’s S3 says systemic antibiotics are not a routine add-on to scaling and root planing, they’re reserved for specific cases, so a model that hands them out freely is failing stewardship, not just a quiz.) Llama 4 Maverick scored 0% there. Zero. Gemini and Qwen managed 60%. Perio diagnosis, the staging and grading you do every day, was the second trap; DeepSeek scored 40%.

GPT-5.2 went 100% in both. Three models kept every domain at 80% or above (GPT-5.2, Claude Opus 4.8, and GPT-5.5), but GPT-5.2 was the only one of them that also topped the table. Its lone slip was perio treatment (80%) on an EFP threshold detail, and almost everyone tripped on the same step, so I’ll let it slide.

Where the field bled, domain by domain

Two columns are graveyards. One row walked through both clean.

Model
Pharma
Diagnosis
Implants
Treatment
Systemic
Comms
GPT-5.2
Claude Opus 4.8
GPT-5.5
Gemini 3.1 Pro
Qwen3.7 Plus
Claude Fable 5
DeepSeek V3.2
Llama 4 Maverick
GPT-5.2 went 100% in both graveyards: pharmacology (Llama scored 0%) and perio diagnosis (DeepSeek 40%). Hover any cell for detail.
0%100%

And it wasn’t slow about it.

Accurate and fast

Top-left is the sweet spot: high accuracy, low latency. Guess who's sitting there.

Model accuracy versus mean latency
ModelAccuracy (%)Mean latency (seconds)
GPT-5.296.714.81
Opus 4.893.312.09
GPT-5.593.320.07
Gemini9020.45
Qwen3.783.340.80
Fable 58016.50
DeepSeek7034.06
Llama 446.725.70

GPT-5.2 was the most accurate model and the second-fastest, behind only Opus 4.8, the judge itself.

On speed it came second only to the judge itself, while sitting alone at the top for accuracy, the rare student who finishes the exam early and gets the A.

Where it stops being a leaderboard

Here’s where this turns from a buyer’s guide into a patient-safety problem.

A model can be wrong. That’s fine, you’re wrong sometimes, I’m wrong sometimes. The danger is how these models are wrong. They don’t hedge. They don’t sweat. They hand you the wrong answer in the same calm, fluent, well-punctuated voice they use for the right one. You cannot hear the difference. That’s the whole thing. That’s what I want you carrying to work tomorrow.

Three from my own data. I dare you to catch them before the guideline does.

Would you have caught it?

Three real answers from the benchmark. Each one is wrong. None of them sounds it.

0/3

caught

Case 1 · The blood thinnerLlama 4 Maverick · Qwen3.7 Plus

How should you manage a routine extraction in a patient taking a direct oral anticoagulant (DOAC)?

The AI answered

Have the patient skip the morning dose before the appointment to reduce bleeding.

Would you have caught it?

Case 2 · The grading mathDeepSeek V3.2

Grade this periodontitis case from the radiographic bone loss and the patient’s age.

The AI answered

Divided 0.5 by age instead of 50 by age, and called a fast-progressing Grade C case a stable Grade A.

Would you have caught it?

Case 3 · It’ll grow backDeepSeek V3.2 · GPT-5.5 · Llama 4 Maverick

Explain to a patient what periodontitis has done to the bone around their teeth.

The AI answered

Implied that the lost bone can heal and come back with better home care.

Would you have caught it?

None of these looked wrong. That’s the point. A fluent wrong answer is more dangerous than an honest “I don’t know,” because the “I don’t know” sends you to verify and the fluent one sends you home.

How much should you trust me?

Now the part I wish more “we tested ChatGPT!” posts included: how much to trust the person running the test.

This is 30 questions per model. Thirty. It’s a directional read, not a Cochrane review.

And remember my judge, Claude Opus 4.8? Scroll back up and flip that toggle on the leaderboard. When I re-ran the grading with an independent GPT judge instead, every single score dropped, GPT-5.2 fell from 96.7% to 83.3%, and the crown actually moved: Gemini 3.1 Pro slipped into first.

The two judges agreed about 82% of the time, a Cohen’s kappa of 0.506, which statisticians politely call “moderate” and the rest of us call two experienced clinicians who’d still argue over a third of the charts. And there were five answers where my own judge contradicted itself: it ticked every criterion as satisfied and still stamped the answer wrong. I could have quietly fixed those to tidy the numbers. I didn’t, they’re flagged in the dataset, because a benchmark I’d secretly edited would just be my opinion wearing a lab coat.

So: directional, not gospel. GPT-5.2 earned my trust on this exam. It did not earn a blank check.

The one that complicates the hero story

One more, because it’s honest. One model, Claude Fable 5, simply refused five questions. The perio-and-diabetes, perio-and-pregnancy, perio-and-Alzheimer’s kind. My scoring counted those as wrong. But sitting here as a clinician, I’m not sure they were the worst answers in the room. A model that says “this is outside what I should answer confidently” is failing in the safe direction, the same direction you fail when you refer out a case that’s above your pay grade.

There’s a strange epilogue. As I write this, Fable 5 is offline, pulled, unavailable. The one model that erred toward caution is the one you can’t even use right now. Make of that what you will. I’m still chewing on it.

What to actually do on Monday

Alright. You came for a verdict and a couple of rules you can use. Here they are.

  1. If you trust one, trust GPT-5.2, with your eyes open. It was the model that didn’t bleed in pharmacology or diagnosis and topped the table. That’s the one I’d let near a clinical question today.
  2. Stop assuming newer is better. The newest model I tested scored lower. When your app updates, your safety doesn’t automatically update with it. Make it re-earn the trust.
  3. Know the landmines. Pharmacology (especially anticoagulants and antibiotics), diagnostic math, and anything you’d repeat to a patient. That’s where fluent confidence hides the ugliest errors.
  4. Ask the one question, every time. Before you act on anything the screen tells you: “What’s the guideline source, and what changes if you’re wrong?” If it can’t name the source, then the source is you now. So verify.

AI isn’t coming for your judgment. But it is very willing to borrow it without asking. Don’t lend it out for free.

This is Episode 1

I’m calling this the Dental AI Trust Index. Every time a major model drops, and they drop constantly, I’ll re-run this exam and publish what changed, so you don’t have to take the upgrade banner’s word for it. Want the next report in your inbox instead of hoping an algorithm shows it to you?

Want to check my work first? The full interactive report and every one of the 240 graded answers live in the open benchmark report, with a citable dataset and DOI on Zenodo. Subscribing just means the next episode lands in your inbox.

Episode 1

The Dental AI Trust Index

Models change every few weeks. I'll keep re-running the exam and publishing what moved, which got safer, which regressed, and which one I'd trust chairside today.

  • Every major model, re-tested on the same 30 questions
  • The next Trust Index report before the algorithm shows it to you
  • Honest, guideline-anchored, no hype, no spam

No spam. Unsubscribe anytime. I hate annoying emails too.

Now go ask your favorite model a staging question. Then check its work. I’ll wait.

Tuminha

Frequently asked questions

The questions dentists actually ask about trusting AI with clinical work.

In this benchmark, GPT-5.2 scored highest, 96.7% across 30 guideline-anchored dental questions, and was one of only three models with no clinical domain below 80% (the only one of them that also topped the table). But the ranking shifted when the grading model changed, so treat it as the strongest current option, not a guarantee.
The top GPT model answered 96.7% of guideline-anchored dental questions correctly here, but the same family’s newer version scored 93.3%, and other models fell to 46.7%. Accuracy depends heavily on which model and which domain, and the most dangerous individual answers came in pharmacology and diagnosis.
No. In this test the newer GPT-5.5 (93.3%) scored lower than the older GPT-5.2 (96.7%). A bigger version number did not mean safer clinical answers.
The most dangerous individual answers came in pharmacology (a recommendation to interrupt a patient’s blood thinner for a routine extraction) and periodontal diagnosis (a grading error that misclassified a Grade C case as Grade A). Pharmacology also produced the single lowest score, with one model managing 0%.
As a verified second opinion, increasingly, but never as the final word. The danger isn’t that AI is wrong; it’s that it’s wrong fluently. Always confirm clinical recommendations against the actual guideline before you act.

Related reading

Francisco Teixeira Barbosa

Francisco Teixeira Barbosa

Founder & Editor

Implant & Digital Dentistry specialist. Periospot founder and managing editor. Executive Director at FOR.