From Scribes to Safety Nets: Inside the OpenAI × Penda Health Copilot Trial
How a traffic‑light LLM trimmed real‑world clinical errors—and why it should reset the go‑to‑market playbook for generative‑AI health tech.
The operating room taught me early: a good assistant anticipates your next move. In medicine’s new AI chapter, the assistant isn’t scrubbed in holding a retractor—it’s running quietly in the background, flagging the misstep you might have made. The question is, can this digital co-pilot really elevate our care without hindering our clinical judgment?
A landmark study from OpenAI and Penda Health caught the attention of clinicians and health tech leaders alike. Unlike the avalanche of AI hype focused on flashy benchmarks and doctor-vs-machine showdowns, this was something different: a real-world experiment in which an AI “clinical co-pilot” shadowed 39,000 patient visits in vivo. The results—fewer diagnostic and treatment errors among clinicians using the AI—hint at a turning point. Below, I unpack the significance of this study in the grounded, skeptical voice of a surgeon-founder who’s seen his share of tech hype cycles.
Context: From Scribes and Billing Bots to a Clinical Co-Pilot
Over the past year, generative AI has been the darling of health tech, but mostly in adjacent domains. Health systems have gravitated to “low-risk” administrative uses—ambient documentation tools and revenue cycle assistants—where automation can ease burnout without touching life-or-death decisions. Indeed, ambient AI scribes that automatically draft clinical notes from patient visits have arguably been the first large-scale application of generative AI in healthcare. These tools promise to offload paperwork by using speech recognition and large language models (LLMs) to transcribe and summarize doctor-patient conversations into electronic health record notes. In parallel, hospitals are eyeing AI to streamline back-office tasks: prior authorizations, billing codes, patient outreach, and other revenue cycle management (RCM) chores. It’s no surprise that these administrative aides gained traction first—after all, a mistake in a clinic note or a billing claim, while annoying, is unlikely to put a patient in danger.
Contrast that with clinical decision support (CDS), where a wrong suggestion could directly impact patient care. Traditional CDS systems (like drug interaction checkers or rule-based care pathways) have been around for years, but LLM-based decision support is largely uncharted territory. Until now, most generative AI “clinical” evaluations have been constrained to bench tests—answering board exam questions, summarizing vignettes, or retrospectively predicting diagnoses from charts. Those studies generated provocative headlines (“AI scores higher than doctors on XYZ test”), but they were a far cry from deploying an AI in the messy milieu of a live clinic. The OpenAI × Penda Health study breaks this mold. It drops an LLM-driven tool into real primary care clinics, assisting frontline providers in real time. In short, it shifts focus from AI as a documentation scribe or data analyst to AI as a true clinical co-pilot at the point of care. This is a fundamentally different use-case—one that treads directly into patient safety, quality of care, and clinician decision-making.
Link to the Study announcement and paper here: OpenAI x Penda Health Study
What the Study Showed: A Safety Net for Clinicians
OpenAI and Penda Health ran a large pragmatic trial of an LLM-powered decision support tool called AI Consult in Penda’s network of primary care clinics in Nairobi. Over roughly three months, 106 clinicians were split into two groups—half had access to AI Consult during patient visits, half provided usual care without it. In total, nearly 40,000 patient visits (across 15 clinics) were analyzed. Crucially, this wasn’t a chatbot diagnosing in a vacuum; it was a tool integrated into the electronic medical record (EMR) that only activated when it detected a possible error in the clinician’s documentation, diagnosis, or treatment plan. When triggered, it would flash a “traffic light” alert: green (no concern), yellow (moderate concern, optional review), or red (serious concern, needs clinician review). This design kept the human firmly in charge—AI Consult remained silent unless it thought the provider might be veering off course, at which point it offered a nudge or warning.
The headline results were promising. According to independent physician reviewers who audited thousands of de-identified visit notes, clinicians with AI assistance made significantly fewer errors than those without. Diagnostic errors dropped by 16%, and treatment errors by 13% in the AI-supported group. The AI group also had fewer mistakes in history-taking (32% reduction) and in ordering appropriate investigations (10% reduction). In practical terms, implementing AI Consult system-wide at Penda could avert an estimated 22,000 diagnostic mistakes and 29,000 treatment mistakes per year. Notably, these improvements weren’t just in trivial or “easy” cases. The greatest error reductions were seen in cases where the AI flagged a serious issue (a “red alert”) – in those encounters, diagnostic errors fell by 31% and treatment errors by 18%, indicating that the tool particularly helped catch likely major misses.
Beyond the numbers, user feedback was strikingly positive. Every clinician in the AI group reported that AI Consult improved their care quality, and 75% of them described this improvement as “substantial”. Many providers saw the AI as a kind of virtual specialist on call – some called it “a consultant in the room” or “one of the best innovations to happen at Penda,” according to the survey comments. Rather than feeling undermined, clinicians embraced it as a learning aid: it broadened their medical knowledge and even sharpened their clinical skills over time. This manifested in an interesting trend: as the study progressed, the AI group clinicians triggered fewer alerts, suggesting they were internalizing the feedback. The percentage of visits that prompted any red-flag alert in the AI group fell from 45% at the start of the study to 35% by the end. In other words, doctors learned from the AI and started preemptively avoiding pitfalls – exactly the kind of behavior change one hopes for with good decision support.
So what about the patients in all this? Reassuringly, there was no evidence of harm or compromised outcomes from using the AI assistant. About 4% of patients in both groups reported not feeling better on routine 8-day follow-up calls, with no significant difference between AI-supported vs. control visits. This real-world safety data is critical: it begins to address the very understandable knee-jerk fear that an AI might “hallucinate” a dangerously wrong suggestion. In this trial, that worst-case scenario didn’t materialize. Clinicians either correctly overrode bad advice, or the system’s design (only flagging clear-cut concerns) largely prevented dangerous blunders.
It’s not all Gravy…Limitations
It’s worth noting a couple of nuances and limitations from the results. First, the improved care did not yet translate to a statistically significant difference in patient-reported outcomes. This isn’t entirely surprising given the relatively short follow-up and the fact that many primary care missteps (missing a slightly elevated blood pressure, or over-prescribing an antibiotic) are unlikely cause immediate symptoms. The study was more powered to detect process improvements (fewer errors) than downstream patient outcomes. Second, clinicians using the AI took slightly longer on visits on average. This makes intuitive sense—reviewing alerts or adjusting plans takes time. The study didn’t quantify the delay in detail, but anecdotally it was a tolerable trade-off for better quality. Still, efficiency is a factor to watch; no one wants an “AI Clippy” that slows down clinicians or frustrates patients. Finally, this was not a blinded trial and not randomized at the individual patient level (clinicians were assigned to AI vs. control groups). It was a quality improvement study done in one health system, which means the findings cannot be generalized. Penda’s clinicians operate in a unique high-volume, resource-variable setting. We don’t know if the same AI copilot would have equal impact in, say, a U.S. suburban outpatient practice, a community clinic in Downtown Chicago, or a tertiary hospital ward. Those are questions for future research.
What Makes This Study Different: Implementation, Not Just Algorithm
Why did this particular AI deployment succeed in reducing errors when so many high-profile AI-in-health efforts (IBM Watson for Oncology, anyone?) fell flat? The answer lies in implementation science excellence—the unglamorous, nuts-and-bolts work of fitting a technology into clinical workflow and culture. The OpenAI–Penda team attributes their positive results to three factors: a capable model, a clinically-aligned implementation, and active deployment support. Let’s unpack those in plain English.
Clinically-Aligned Design (Clinicians in the Loop): AI Consult wasn’t parachuted into clinics by techies and left to sink or swim; it was co-developed with Penda’s clinicians from early on. Penda had already experimented with decision support tools for years (from simple checklists to a first-gen LLM assistant), learning what worked and what didn’t. Those lessons shaped AI Consult’s design. Crucially, the tool was built to respect the flow of a patient visit. It runs asynchronously in the background, intercepting the clinician only at natural decision points. When a provider enters a diagnosis or prescription into the EMR, AI Consult evaluates it against the patient’s data and clinical guidelines. If everything looks reasonable, the AI stays quiet (green status) so as not to create alert fatigue. If it spots a minor issue or a possible improvement, it offers a gentle nudge (yellow advisory). And if it detects something that looks outright unsafe or inconsistent with standards, it raises a red flag that essentially says “Stop 🛑 – reconsider this decision”. This traffic-light alert interface is simple by design; it minimizes cognitive load. Clinicians don’t have to wade through paragraphs of AI text in the middle of seeing a patient. Instead, they get a clear visual cue proportional to the severity of a potential error. And importantly, they can ignore the AI’s suggestion at their discretion. AI Consult doesn’t force anything; it preserves clinician autonomy, functioning like a failsafe alarm system rather than an intrusive second guesser. This approach is worlds apart from earlier “AI in medicine” ideas where an algorithm might independently make diagnoses or treatment plans. Here, the AI is an ever-vigilant assistant, not an autonomous decision-maker. I thought this was actually a critical take home for me as a clinician-builder. I love the idea of deploying ambitious, transformative care orchestration tools. But simplicity, reliable implementation and measurable results trump moving fast and breaking things every time.
Shadow Mode and Iteration: Before rolling out AI Consult system-wide, Penda tested it in “shadow mode” and pilot phases to ensure it was both effective and safe. In early versions (AI Consult v1), clinicians could voluntarily summon AI advice by clicking a button in the EMR when they wanted input on a case. That version saw limited use—busy providers often skipped hitting the help button, and long AI narratives were impractical to read during short visits. The team learned that passive or on-demand AI wasn’t moving the needle. So they iterated toward the current design: an active assistant that automatically monitors each visit and only interrupts when necessary. This measured iterative deployment contrasts sharply with the “big bang” go-lives that sink many hospital tech projects. By the time the randomized study began, AI Consult was already tuned to Penda’s environment (e.g. calibrated to Kenyan clinical guidelines and epidemiology) and had buy-in from leadership and front-line users.
Active Deployment (Training and Culture Change): One of the most interesting findings was that AI alone didn’t automatically yield full benefits—people had to learn how to use it well. During an initial “induction” period, clinicians with AI got the same number of red alerts as those without (in the control group, researchers ran the AI in the background to see what it would have flagged). In that early phase, the AI-group providers often ignored the red alerts or didn’t resolve them, resulting in a high “left in red” rate (~35–40% of serious flags went unaddressed, similar to control). After this, Penda’s leadership intervened with a robust change management program—what the study calls “active deployment.” They introduced one-on-one coaching for clinicians, identified peer “AI champions”, and gave feedback on performance relative to the AI alerts. The impact was dramatic: once these supports kicked in, the proportion of unheeded critical alerts in the AI group plummeted to ~20%, while the control (no-AI) group remained at ~40%. In plainer terms, doctors started paying attention to the AI’s red flags and acting on them. This is a vital point for anyone implementing AI in healthcare: the tool’s success depended on user training and engagement, not just the tech itself. I have actually seen this exact same resistance to change implementation with a triage voice agent. Inbox items piled up when a medical assistant user didn’t utilize our triage queue, and the utility of the tool started to be questioned. Dedicated training turned things around quickly. It’s the classic last-mile problem—if clinicians don’t trust or understand the system, it won’t matter how good your algorithm is. Penda cracked that nut by investing in onboarding, continual reinforcement, and cultivating a sense that using AI Consult was now part of their standard for high-quality care.
Safety and Trust Mechanisms: Penda’s approach also stood out for its safety checks. They performed an extensive internal audit of AI suggestions early on, scoring each for quality and potential harm. Any AI outputs that were inappropriate or dangerous (scores of “1” on their 5-point scale) were catalogued and presumably used to refine the system or provide feedback to OpenAI’s team. This created a feedback loop to address “hallucinations” or bad advice before they could propagate widely. Additionally, by involving Kenya’s regulators, ethics boards, and the clinicians themselves from the start, they built a foundation of trust. Clinicians knew that the tool had been vetted and was there to help, not judge them. This bottom-up trust-building is as important as any UI tweak. It’s probably one reason the human users were enthusiastic, even grateful, for AI Consult rather than defensive about it.
The OpenAI–Penda Health study wasn’t a victory for a particular algorithm; it was a proof-of-concept for how to responsibly weave AI into the clinical fabric. The traffic-light system, the shadow testing, the iterative co-design, the training and cultural alignment—these are the unsung heroics behind those headline error reduction stats. It’s implementation with a capital “I”. As a surgeon who’s seen plenty of health IT pilot projects fizzle, I found this especially heartening. It speaks to a future where AI isn’t an alien gadget dropped onto doctors, but a natural extension of the team, developed with the team.
Why It Matters: A Glimpse of the Real “AI Application Layer” in Care
This study offers a rare peek at what it looks like when generative AI actually joins the care team rather than observing from the sidelines. In doing so, it moves the conversation beyond “Can an AI score as well as a doctor on a test?” to “Can an AI help doctors perform better in the real world?”. That shift from performance in isolation to performance integration is profound for the field of health AI.
There are implications here for trust and adoption of AI in healthcare. We’ve all heard the refrain that clinicians are skeptical of black-box AI, worried about reliability and medicolegal risk. And to be sure, those concerns don’t vanish overnight. But the Penda Health experience demonstrates that when AI is deployed thoughtfully and tangibly improves clinicians’ day-to-day work, trust can be earned. By the end of the study, Penda’s clinicians reportedly didn’t want to give the tool up. It had become part of their norm, even a source of confidence for less-experienced providers. That kind of voluntary uptake is magic in healthcare innovation. Just harken back to how long it took for electronic prescribing or checklists to gain acceptance, often by mandate. Here we have anecdotal evidence of an AI system that clinicians voluntarily evangelized. For a health tech leader, this is the dream scenario: the frontline pulling the tech into practice, rather than it being pushed by administrators or regulators.
Third, the study underscores the potential for behavior change and continuous learning. We often talk about AI learning from data, but this is about clinicians learning from AI. The drop in red alerts over time suggests that doctors were internalizing the AI’s guidance—perhaps double-checking doses or re-thinking a differential diagnosis before the AI had to say anything. In essence, the AI co-pilot can function like a seasoned mentor looking over the shoulder of a junior clinician, nudging them until the good habits stick. That raises fascinating possibilities: could sustained AI support actually raise the baseline quality of a clinical workforce over months or years? Could it accelerate the professional development of clinicians by exposing them to more consistent feedback? Those aren’t questions the study answered, but it certainly prompts them. If the answer is yes, it means AI tools might not just catch mistakes in the moment, but also make providers better long-term—which multiplies the impact on patient care.
Finally, this real-world trial helps cut through the abstract hype around “AI in healthcare” by providing a concrete blueprint. It’s a reminder that the real innovation is in application, not just invention. We’ve seen dazzling demos of GPT-4 solving medical puzzles on paper, but until now we had little evidence that such models could reliably slot into clinical workflows. The Penda study gives a tentative affirmative: yes, general-purpose AI can be harnessed in frontline care to measurably improve quality, and it can be done without chaos or catastrophe. That nudges the Overton window open for health systems to start seriously considering these tools in care delivery. For health system executives and policymakers, it’s a signal that AI is graduating from the lab to the clinic’s frontlines. The conversation can evolve from “Can it do it?” to “How do we do it responsibly at scale?”.
Watchouts and Open Questions: From Hype to Lasting Change
For all the optimism, seasoned clinicians and health IT folks will rightly ask: what are the catch points? Here are a few caveats and questions that emerged for me, viewing this through a pragmatic lens:
Durability of Behavior Change: The “training effect” observed—clinicians improving by using the tool—is encouraging, but will it last? Humans are quick to revert to old habits, especially if a new tool is removed or if novelty wears off. It’s one thing to see error rates drop over a 3-month pilot; it’s another to maintain that improvement 3 years later. There’s a possible paradox here. If the AI co-pilot does its job exceptionally well, clinicians might eventually become dependent on it (like always having GPS for driving). Does reliance creep in such that, if the AI were suddenly unavailable, errors would spike above even baseline? Or conversely, does it truly “embed” better practice in the clinician such that the AI becomes less necessary over time? We don’t know. Penda is reportedly continuing this research in a more formal randomized controlled trial focused on patient outcomes. Watching the long-term curves from such studies—and even doing post-trial assessments of clinician knowledge/behavior—will be key to understanding the lasting impact. I’d also be interested in how this scales to specialty care or more complex decision environments. Primary care is broad and dynamic, but each decision node (like choosing a diagnosis or med) is relatively bounded. In surgery or emergency care, decisions can be more fluid and multi-step. Would a co-pilot be as effective there? These questions linger.
Regulatory and Legal Hurdles: It’s not lost on anyone that this groundbreaking deployment happened in Kenya, not the U.S. American hospitals face a complex web of regulations for clinical decision support systems. In the U.S., an AI system providing diagnostic or treatment guidance can quickly fall under the definition of a medical device, triggering FDA oversight. To date, the FDA has not approved any generative AI-based clinician-facing tools, and current rules struggle to accommodate the adaptive, open-ended nature of LLM outputs. Regulators worry, rightly, that if an AI is effectively telling a doctor how to treat a patient, it had better be thoroughly vetted for safety and efficacy. Who is liable if the AI suggests something that leads to harm—the clinician, the hospital, the software vendor? These are thorny issues. The Penda study skirts them for now by taking place in a different regulatory context and under a research banner. But if health systems in the U.S. or Europe want to replicate this, they’ll need to navigate compliance. This might involve positioning the AI as simply a reference tool (to fit under regulatory exemptions for “non-device” clinical decision support), or going all-in on obtaining regulatory clearance as a Software-as-Medical-Device (SaMD). Policymakers will need to rethink frameworks. This might include new “general purpose AI” approval pathways, so that beneficial tools like AI Consult aren’t indefinitely stuck in pilot purgatory due to regulatory gray zones.
Hallucinations and Reliability: The specter of AI “hallucination” (confidently wrong answers) always looms. Generative models sometimes output content that is nonsensical or dangerously incorrect. Penda’s controlled implementation mitigated this risk by limiting when and how the AI could interject, and by keeping a human in the loop. In the Penda trial, users still exercised judgment—there were certainly cases where the AI might have flagged something incorrectly and the clinician wisely ignored it. But will that always be so? As these tools get better and busier clinicians come to rely on them, the risk of automation bias increases. We need safeguards: improved AI training to reduce outright errors, continual monitoring of AI recommendations in practice, and perhaps even “confidence indicators” or explanations for why the AI flagged something. Interestingly, the study team did have the AI explain its red flags in text when they occurred, which clinicians could review. We might require even more robust explanation features to catch truly off-base suggestions. Moreover, we must account for the inevitable future incident where an AI co-pilot somewhere does contribute to a patient’s harm. How the community reacts—do we halt all use, do we issue fixes and move on?—will shape the trajectory of trust in this tech. Transparency will be paramount; if AI Consult or its successors make a serious mistake, users and other stakeholders must hear about it along with a plan for prevention. In aviation (where the “co-pilot” metaphor originates), every close-call or error prompts an incident investigation to update protocols. Healthcare AI might need a similar safety reporting and learning system.
Generalizability and Access: This study took place in a resource-limited setting with non-physician clinicians (Kenyan clinical officers) as primary users. One could argue that such contexts are both where AI decision support is most needed (due to clinician shortages and wide knowledge breadth required) and where it might be easier to implement (less legacy IT, more openness to innovation). If AI Consult works in Nairobi, will it work in New York? The patient mix and disease patterns differ; the medicolegal environment differs; clinician training differs. The AI was tuned to local guidelines and patterns—doing the same for U.S. guidelines is feasible, but what about the variability of practice standards and customization for each hospital’s protocols? And there’s the question of whether AI could inadvertently widen disparities and care gaps. If well-resourced health systems adopt AI copilots and see care quality jump, that’s great for them—but what about safety-net systems that can’t afford these tools or the implementation effort? Conversely, perhaps this tech could be a leveling force, by bringing up quality in under-resourced areas the most. We’ll need more studies in varied contexts (community hospitals, academic centers, different countries) to answer these questions. The good news is that after Penda Health, more health organizations are likely to step forward and experiment, ideally in partnership with researchers. Early evidence is promising, but as any scientist would say, we need replication.
As a tech-optimist surgeon, I’m excited, but also mindful that nothing in health care is a simple plug-and-play. The real work begins after the first breakthrough: making sure it wasn’t a fluke and that we manage the risks on the road to scale.
Closing Thoughts: Beyond the “AI vs. Doctor” Trope (and What’s Next)
Reading the OpenAI × Penda paper felt different from the usual AI-in-medicine fare. It wasn’t another contest of man versus machine on a contrived data set, nor a grandiose claim that “AI will replace doctors.” Instead, it presented a collaborative model—AI working with clinicians, quietly improving care in the background, like a surgeon’s steady assistant. This is a refreshing departure from the trope of pitting algorithms against physicians. As a clinician-builder, I find this augmentation narrative far more compelling. It aligns with the reality that healthcare is delivered by teams, and now perhaps one of those team members can be an AI that never sleeps and has read every guideline under the sun.
We should, of course, keep our critical eyes open. But it’s hard not to see this study as a glimpse of the future if we get things right. A future where AI in health care is neither magic nor menace, but a normal part of clinical practice. Where saying “let’s run the AI check” could be as routine as a pharmacist running a drug interaction screen. Achieving that at scale will require more validation, more fine-tuning, and likely a few setbacks to learn from. Yet, the door has been opened. As Dr. Isaac Kohane at Harvard put it after reviewing the Penda Health results, “This is what I was waiting for”—evidence from a prospective, controlled deployment that AI can tangibly help real patients and providers.
In my next article (Part II), I’ll dive deeper into what it might take to bring an AI co-pilot like this into a large health system in the U.S. I’ll try to explore the operational and policy hurdles, and whether the juice is worth the squeeze from a healthcare ROI perspective. We’ll also discuss how these findings could influence medical training and the notion of clinical expertise in the age of AI assistance. For now, I’ll leave you with this thought: the true measure of “AI in healthcare” will not be whether it can outperform doctors in a quiz, but whether it can help doctors continue improving. The OpenAI–Penda Health study gives us a real use-case where that seems to have happened. And that is a story worth following.
References
Peterson Health Technology Institute (PHTI) AI Taskforce. Adoption of Artificial Intelligence in Healthcare Delivery Systems: Early Applications and Impacts. PHTI; March 2025.
OpenAI & Penda Health (Korom R, Kiptinness S, Adan N, et al.). Pioneering an AI Clinical Copilot with Penda Health. OpenAI Blog. Published July 22, 2025.
Korom R, Kiptinness S, Adan N, et al. AI-based clinical decision support for primary care: a real-world study. arXiv [preprint] 2507.16947; posted July 31, 2025.
Barry J. OpenAI delivers largest-ever study of clinical AI. Digital Health Wire. July 24, 2025.
Park A. AI helps prevent medical errors in real-world clinics. TIME. July 2025.
Weissman A, Levine J. AI in health care and the FDA’s blind spot. Penn LDI (Leonard Davis Institute) Research Update. 2023.