
A newsletter on the latest in AI for healthcare.
Welcome back,
A new clinical large language model study lands a blunt warning: safety does not scale neatly with accuracy. Clean, clinician-written evidence did more than bigger models or extra compute to cut risky answers.
Tempus launched the ArteraAI Prostate Test for metastatic patients, bringing prostate digital pathology into its clinical ecosystem.
Also inside: the MONAI repo for medical imaging AI, plus health AI and healthcare data moves from Kordata Dynamics, Century Health, and Qualtrics.
Here is what you need to know,
SUMMARY
Top Research Paper
Clinical LLM safety and accuracy respond differently to scaling, with clean clinician-written evidence delivering the largest gains.
Top AI News
Tempus clinically launched the ArteraAI Prostate Test, an AI-powered prostate digital pathology test that analyses clinical data and biopsy images to estimate prostate cancer-specific mortality risk in metastatic hormone-sensitive prostate cancer.
Top Model
MONAI is a PyTorch-based open-source toolkit for medical imaging AI, with ready-made components for building, training, validating, and sharing imaging models more easily.
Bedside Bets
Startup rounds, deals, and moves.
Kordata Dynamics builds AI-powered clinical trial infrastructure and emerged from stealth with pre-seed backing. It uses BIOS Health’s neural biomarker technology for faster precision medicine studies. Deal value not disclosed.
Century Health turns clinical records into research-ready data and raised $5m to scale its AI abstraction platform. Its CHARM model reports 97% accuracy against expert review.
Qualtrics uses experience data AI to predict patient and workforce needs and bought Press Ganey Forsta for $6.75bn, adding healthcare experience data to its AI platform.
Pulse Check
Quick reads across health AI.
A multitask AI system supports basal cell carcinoma diagnosis with dual explanations.
Cedars-Sinai is deploying OpenEvidence, an AI-enabled clinical reference tool that links medical evidence to patient electronic health record context.
NEJM asks whether AI can say “I don’t know”, a key safety question for clinical decision support.
Microsoft Agent Framework helps healthcare AI builders orchestrate production agents in .NET and Python, with use cases like EHR summarisation, prior authorisation workflows, and clinical trial screening.
TOP PAPER
⚖️ Safety and accuracy follow different scaling laws in clinical large language models
Source: arXiv · 5 May 2026
The study challenges the common assumption that scaling clinical large language models for higher accuracy automatically improves safety. It introduces a dedicated evaluation framework to measure how safety metrics behave across practical deployment variables in radiology question-answering tasks.
Question
How do accuracy and clinically relevant safety metrics, including high-risk errors, evidence contradictions, and dangerous overconfidence, scale with model size, evidence quality, retrieval strategy, context length, and inference-time compute in clinical large language models?
Approach
Developed SaFE-Scale and RadSaFE-200, a benchmark of 200 multiple-choice radiology questions with clinician-curated clean and conflict evidence plus option-level safety labels.
Evaluated 34 locally deployed large language models across Qwen, Llama, Gemma/MedGemma, DeepSeek, Mistral, and OpenAI-OSS families.
Tested six deployment conditions: closed-book, clean evidence, conflict evidence, standard retrieval-augmented generation, agentic retrieval-augmented generation, and max-context prompting.
Secondary tests included self-consistency and fixed three-model ensembles. Outcomes tracked accuracy, high-risk error rate, unsafe answers, contradictions, and dangerous overconfidence.

Source: Wind, Nguyen et al
Results
Clean evidence raised mean accuracy from 73.5% to 94.1% and cut high-risk error from 12.0% to 2.6%.
Contradictions fell from 12.7% to 2.3%, while dangerous overconfidence dropped from 8.0% to 1.6%.
Standard and agentic retrieval-augmented generation improved accuracy modestly, from 76.0% to 78.1%, and reduced some contradictions, but left high-risk error and overconfidence elevated versus clean evidence.
Max-context prompting increased latency without closing safety gaps. Self-consistency gave small gains. Ensembles improved aggregate scores but preserved synchronised failures on hard cases.
Caveat
The benchmark focuses on radiology question-answering, so results may not transfer cleanly to other clinical settings.
Potential impact: Clinical large language model deployment should measure safety directly under the target evidence and retrieval conditions, rather than relying on accuracy benchmarks as a proxy
TOP NEWS
Tempus brings prostate digital pathology into its clinical ecosystem
Source: Tempus via Business Wire · 21 May 2026
Tempus has made the ArteraAI Prostate Test for metastatic hormone-sensitive prostate cancer clinically available. It is the first externally developed digital pathology algorithm in the Tempus ecosystem.
The CLIA-certified, CAP-accredited test combines patient clinical data and histopathology images to generate personalised risk estimates of prostate cancer-specific mortality.
Size: Roughly 25,000 US patients are newly diagnosed with metastatic prostate cancer each year.
Scope: Metastatic hormone-sensitive prostate cancer risk estimation using clinical data and digital pathology.
Opportunity: Tempus can pair the test with its next-generation sequencing assays, giving clinicians a more complete view of tumour biology and risk.
Why it matters: Therapy intensity decisions in metastatic prostate cancer are hard, and risk estimation can be fragmented across genomics, pathology, and clinical features. Tempus is pushing toward a more multimodal commercial model, where tests, data, and algorithms sit inside one clinical ordering ecosystem.
TOP REPO FOR BUILDERS
MONAI gives medical imaging teams a production-ready PyTorch foundation
MONAI is an open-source PyTorch framework for deep learning in healthcare imaging. It provides domain-specific tools that bridge research and clinical deployment.
Useful details
Flexible multi-dimensional pre-processing, compositional APIs, and domain-specific networks, losses, and metrics.
Supports multi-GPU and multi-node parallelism, bundles for reproducible workflows, and a model zoo.
Includes tutorials, Docker images, and integration with the broader PyTorch ecosystem.
Why it matters: MONAI gives imaging AI teams a standardised foundation for building, evaluating, and sharing medical imaging models across academic and clinical environments.
BEDSIDE BETS
Bedside Bets
Startup rounds, deals, and moves in healthcare AI.
Kordata Dynamics builds AI-powered clinical trial infrastructure and emerged from stealth with pre-seed backing. It uses BIOS Health’s neural biomarker technology for faster precision medicine studies. Deal value not disclosed.
Century Health turns clinical records into research-ready data and raised $5m to scale its AI abstraction platform. Its CHARM model reports 97% accuracy against expert review.
Qualtrics uses experience data AI to predict patient and workforce needs and bought Press Ganey Forsta for $6.75bn, adding healthcare experience data to its AI platform.
Quick reads across health AI
A multitask AI system supports basal cell carcinoma diagnosis with dual explanations.
Cedars-Sinai is deploying OpenEvidence, an AI-enabled clinical reference tool that links medical evidence to patient electronic health record context.
NEJM asks whether AI can say “I don’t know”, a key safety question for clinical decision support.
Microsoft Agent Framework helps healthcare AI builders orchestrate production agents in .NET and Python, with use cases like EHR summarisation, prior authorisation workflows, and clinical trial screening.
If this newsletter was forwarded to you, subscribe here or see more
NEWSLETTER BY:
Dr Ezekiel Dinama
MD and PhD Researcher at Cambridge University applying physics-informed ML/AI to neurophysiological research.
