AI in Healthcare: New Stanford Benchmark Measures Real-World Performance - 1

Image by Irwan, from Unsplash

AI in Healthcare: New Stanford Benchmark Measures Real-World Performance

  • Written by Kiara Fabbri Former Tech News Writer
  • Fact-Checked by Sarah Frazier Former Content Manager

Stanford researchers conducted virtual EHR tests of AI agents, which report how models like Claude 3.5 can assist doctors with routine healthcare tasks.

In a rush? Here are the quick facts:

  • AI agents can perform tasks like ordering tests and prescribing medications.
  • Claude 3.5 Sonnet v2 achieved the highest success rate at 70%.
  • Many AI models struggled with complex workflows and system interoperability.

Stanford researchers are setting new evaluation criteria to determine whether AI systems’ are able to perform real-world medical tasks. While AI demonstrated potential for medical applications in various fields, experts warn it still needs further testing.

“Working on this project convinced me that AI won’t replace doctors anytime soon,” said Kameron Black, co-author and Clinical Informatics Fellow at Stanford Health Care.

In order to investigate this, the team developed MedAgentBench , a virtual electronic health record (EHR) system, built to assess how AI agents performed medical procedures that doctors do on a daily basis.

It is important to note that unlike chatbots, AI agents can act autonomously, handling complex, multistep tasks using patient data, ordering tests, and prescribing medications.

“Chatbots say things. AI agents can do things,” said Jonathan Chen, associate professor of medicine and biomedical data science and senior author. “This means they could theoretically directly retrieve patient information from the electronic medical record, reason about that information, and take action by directly entering in orders for tests and medications. This is a much higher bar for autonomy in the high-stakes world of medical care. We need a benchmark to establish the current state of AI capability on reproducible tasks that we can optimize toward,” Chen added.

In order to test the virtual system, the researchers gained data from 100 patient profiles, which accumulated 785,000 records. Secondly, about a dozen large language models (LLMs) were tested on 300 clinical tasks.

The results showed that the Claude 3.5 Sonnet v2 model achieved a 70% success rate as the top-performing model, however many models failed to handle complex workflows, as well as system integration processes.

“We hope this benchmark can help model developers track progress and further advance agent capabilities,” said Yixing Jiang, PhD student and co-author.

The experts predict that AI agents will take over basic clinical administrative work, hopefully decreasing physician burnout without fully replacing human doctors from practice.

“I’m passionate about finding solutions to clinician burnout,” Black said. “I hope that by working on agentic AI applications in healthcare that augment our workforce, we can help offload burden from clinicians and divert this impending crisis,” Black added.

World’s Largest ChatGPT Study Shows How The Bot Shapes Daily Life - 2

Image by SOlen Feyissa, from Unsplash

World’s Largest ChatGPT Study Shows How The Bot Shapes Daily Life

  • Written by Kiara Fabbri Former Tech News Writer
  • Fact-Checked by Sarah Frazier Former Content Manager

OpenAI released on Monday what it calls “the largest study to date of how people are using ChatGPT,” exploring how the chatbot is shaping daily life and work.

In a rush? Here are the quick facts:

  • ChatGPT adopted by 10% of world’s adults by July 2025.
  • Over 700 million weekly users send 2.5 billion messages daily.
  • About 49% of chats are for advice

The study , conducted with Harvard economist David Deming and published as a National Bureau of Economic Research (NBER) working paper, analyzed 1.5 million conversations.

“We’re releasing the largest study to date of how people are using ChatGPT, offering a first-of-its-kind view into how this broadly democratized technology creates economic value through both increased productivity at work and personal benefit,” OpenAI stated .

The study reveals that ChatGPT user base grew to 700 million weekly active users who sent 2.5 billion daily messages. The chatbot has been used by around 10% of the world’s adult population by July 2025.

Research indicates that worldwide adoption rates have increased, especially in low- and middle-income countries, where growth has been over four times faster than in richer nations.

One major finding is that the early gender gap in usage has closed. “As of mid-2025, ChatGPT’s early gender gaps have narrowed dramatically, with adoption resembling the general adult population,” the report said. By July 2025, more than half of weekly active users had female first names.

Most people are turning to ChatGPT for everyday tasks. According to the research paper, “the three most common ChatGPT conversation topics are Practical Guidance, Writing, and Seeking Information, collectively accounting for nearly 78% of all messages.”

Writing dominates work-related tasks, while personal queries such as reflection, advice, and play are growing quickly.

OpenAI explained, “About half of messages (49%) are ‘Asking,’ a growing and highly rated category that shows people value ChatGPT most as an advisor rather than only for task completion.”

However it is important to note that earlier results from a study by OpenAI and MIT examining ChatGPT’s effects on users’ well-being, showed some potentially worrying trends.

Results showed that heavier use of the chatbot, especially for more personal inquiries, reported greater loneliness, isolation and reliance on the system. The researchers did not present solutions yet they demonstrated that sustained, intimate AI interactions can affect mental health.

While ChatGPT drives productivity and guidance, the study also warns that heavy, personal use may impact users’ mental well-being over time.