A newsletter on the latest in AI for healthcare.

Welcome back,

A new clinical large language model study lands a blunt warning: safety does not scale neatly with accuracy. Clean, clinician-written evidence did more than bigger models or extra compute to cut risky answers.

Tempus launched the ArteraAI Prostate Test for metastatic patients, bringing prostate digital pathology into its clinical ecosystem.

Also inside: the MONAI repo for medical imaging AI, plus health AI and healthcare data moves from Kordata Dynamics, Century Health, and Qualtrics.

Here is what you need to know,

SUMMARY

Top Research Paper

Top AI News

Top Model

  • MONAI is a PyTorch-based open-source toolkit for medical imaging AI, with ready-made components for building, training, validating, and sharing imaging models more easily.

Bedside Bets

Startup rounds, deals, and moves.

Pulse Check

Quick reads across health AI.

TOP PAPER

⚖️ Safety and accuracy follow different scaling laws in clinical large language models

Source: arXiv · 5 May 2026

The study challenges the common assumption that scaling clinical large language models for higher accuracy automatically improves safety. It introduces a dedicated evaluation framework to measure how safety metrics behave across practical deployment variables in radiology question-answering tasks.

Question

  • How do accuracy and clinically relevant safety metrics, including high-risk errors, evidence contradictions, and dangerous overconfidence, scale with model size, evidence quality, retrieval strategy, context length, and inference-time compute in clinical large language models?

Approach

  • Developed SaFE-Scale and RadSaFE-200, a benchmark of 200 multiple-choice radiology questions with clinician-curated clean and conflict evidence plus option-level safety labels.

  • Evaluated 34 locally deployed large language models across Qwen, Llama, Gemma/MedGemma, DeepSeek, Mistral, and OpenAI-OSS families.

  • Tested six deployment conditions: closed-book, clean evidence, conflict evidence, standard retrieval-augmented generation, agentic retrieval-augmented generation, and max-context prompting.

  • Secondary tests included self-consistency and fixed three-model ensembles. Outcomes tracked accuracy, high-risk error rate, unsafe answers, contradictions, and dangerous overconfidence.

Source: Wind, Nguyen et al

Results

  • Clean evidence raised mean accuracy from 73.5% to 94.1% and cut high-risk error from 12.0% to 2.6%.

  • Contradictions fell from 12.7% to 2.3%, while dangerous overconfidence dropped from 8.0% to 1.6%.

  • Standard and agentic retrieval-augmented generation improved accuracy modestly, from 76.0% to 78.1%, and reduced some contradictions, but left high-risk error and overconfidence elevated versus clean evidence.

  • Max-context prompting increased latency without closing safety gaps. Self-consistency gave small gains. Ensembles improved aggregate scores but preserved synchronised failures on hard cases.

Caveat

  • The benchmark focuses on radiology question-answering, so results may not transfer cleanly to other clinical settings.

Potential impact: Clinical large language model deployment should measure safety directly under the target evidence and retrieval conditions, rather than relying on accuracy benchmarks as a proxy

TOP NEWS

Tempus brings prostate digital pathology into its clinical ecosystem

Source: Tempus via Business Wire · 21 May 2026

Tempus has made the ArteraAI Prostate Test for metastatic hormone-sensitive prostate cancer clinically available. It is the first externally developed digital pathology algorithm in the Tempus ecosystem.

The CLIA-certified, CAP-accredited test combines patient clinical data and histopathology images to generate personalised risk estimates of prostate cancer-specific mortality.

  • Size: Roughly 25,000 US patients are newly diagnosed with metastatic prostate cancer each year.

  • Scope: Metastatic hormone-sensitive prostate cancer risk estimation using clinical data and digital pathology.

  • Opportunity: Tempus can pair the test with its next-generation sequencing assays, giving clinicians a more complete view of tumour biology and risk.

Why it matters: Therapy intensity decisions in metastatic prostate cancer are hard, and risk estimation can be fragmented across genomics, pathology, and clinical features. Tempus is pushing toward a more multimodal commercial model, where tests, data, and algorithms sit inside one clinical ordering ecosystem.

TOP REPO FOR BUILDERS

MONAI gives medical imaging teams a production-ready PyTorch foundation

MONAI is an open-source PyTorch framework for deep learning in healthcare imaging. It provides domain-specific tools that bridge research and clinical deployment.

Useful details

  • Flexible multi-dimensional pre-processing, compositional APIs, and domain-specific networks, losses, and metrics.

  • Supports multi-GPU and multi-node parallelism, bundles for reproducible workflows, and a model zoo.

  • Includes tutorials, Docker images, and integration with the broader PyTorch ecosystem.

Why it matters: MONAI gives imaging AI teams a standardised foundation for building, evaluating, and sharing medical imaging models across academic and clinical environments.

BEDSIDE BETS

Bedside Bets

Startup rounds, deals, and moves in healthcare AI.

Quick reads across health AI

Explore Education and Careers resources to build a career in healthcare AI/ML.

How was today’s issue?

If this newsletter was forwarded to you, subscribe here or see more

NEWSLETTER BY:
Dr Ezekiel Dinama

MD and PhD Researcher at Cambridge University applying physics-informed ML/AI to neurophysiological research.

Keep Reading