
Image by Emilinao Vittoriosi, from Unsplash
OpenAI’s New AI Models Can Now “Think” With Images
- Written by Kiara Fabbri Former Tech News Writer
- Fact-Checked by Sarah Frazier Former Content Manager
OpenAI has launched o3 and o4-mini, advanced AI models that combine image manipulation with text-based reasoning to solve complex problems.
In a rush? Here are the quick facts:
- These models manipulate, crop, and transform images to solve complex tasks.
- o3 and o4-mini outperform earlier models in STEM questions, visual search, and chart reading.
- The models combine text and image processing, using tools like web search and code analysis.
OpenAI has announced two new AI models, o3 and o4-mini, that can reason with images—marking a major leap in how artificial intelligence understands and processes visual information.
“These systems can manipulate, crop and transform images in service of the task you want to do,” said Marc Chen, OpenAI’s head of research, during a livestream event on Wednesday, as reported by the New York Times .
The o3 and o4-mini models now have the ability to analyze images as part of their internal thinking process, whereas previous models could only see images.
The system enables users to upload photos of math problems, technical diagrams, handwritten notes, posters, and blurry or rotated images. It will break down the content into step-by-step explanations, regardless of multiple questions or visual elements in one image.
The system can now focus on unclear parts of an image, rotating it for better understanding. It combines visual understanding with text-based reasoning to deliver precise answers. The system can interpret science graphs to explain their meaning and identify coding errors in screenshots to generate solutions.
The models can also use other tools like web search, Python code, and image generation in real time, which allows them to solve much more complex tasks than before. OpenAI says these capabilities come built-in, without needing extra specialized models.
Tests show that o3 and o4-mini perform better than previous models in all visual tasks they were given. The visual search benchmark, known as V*, shows o3 reaching 95.7% accuracy. However, the models still have some flaws, as OpenAI states they can produce overthinking mistakes and basic perception errors.
OpenAI introduced this update as part of its initiative to develop AI systems that reason similarly to humans. The models require extensive thought sequences to function, which means they need extra time to handle complex questions. They also integrate tools like image generation, web search, and Python code analysis to give more precise and creative answers.
However, there are limits. The models sometimes process excessive amounts of information, make perception errors, and shift their reasoning approaches between attempts. The company is working to improve the models’ reliability and consistency.
Both o3 and o4-mini are now available to ChatGPT Plus ($20/month) and Pro ($200/month) users. OpenAI also released Codex CLI, a new open-source tool to help developers run these AI models alongside their own code.
While OpenAI faces legal challenges over content use, its visual reasoning tech shows how AI is getting closer to solving real-world problems in more human-like ways.

Image by Chris Blonk, from Unsplash
AI Model Seeks To Decode What Dolphins Are Saying
- Written by Kiara Fabbri Former Tech News Writer
- Fact-Checked by Sarah Frazier Former Content Manager
Researchers studying dolphin communication are now using a new AI model developed by Google to better understand the structure of dolphin vocalizations.
In a rush? Here are the quick facts:
- Google developed an AI model to analyze and generate dolphin vocalizations.
- The model is trained on decades of WDP dolphin communication data.
- DolphinGemma will be open-sourced to support global cetacean communication research.
The model, known as DolphinGemma, analyzes recorded dolphin sounds to identify patterns and generate sequences of dolphin-like vocalizations, as announced Tuesday on a Google News Blog .
The initiative is a collaboration between Google, the Georgia Institute of Technology, and the Wild Dolphin Project (WDP) , a non-profit that has been researching a community of wild Atlantic spotted dolphins in the Bahamas since 1985.
Google reports that the tool is designed to support the study of interspecies communication by identifying patterns within the complex sound sequences dolphins use in the wild. Their long-term, non-invasive fieldwork has produced a substantial archive of audio and video, which is now being used to train AI systems.
This includes examples of known sound types—such as signature whistles used between mothers and calves, burst pulses during conflict, and click buzzes in courtship or predator interactions.
The aim is to better understand the structure of these vocalizations and what they may indicate about dolphin cognition and communication.
DolphinGemma builds on this dataset by applying Google’s SoundStream audio processing and a 400-million parameter model architecture to learn and predict dolphin sounds.
Rather than attempting to translate the sounds directly, the model processes sequences of natural dolphin vocalizations and generates new, dolphin-like sounds based on learned patterns.
Google reports that the approach mirrors how large language models handle human language, predicting likely continuations based on prior input.
The model is currently being tested in the field by WDP using Pixel smartphones. Researchers hope it will help identify recurring vocal structures and reduce the manual labor involved in parsing large volumes of acoustic data.
The tool may also assist in identifying potential building blocks of communication systems among dolphins. Alongside this effort, WDP and Georgia Tech are developing the CHAT (Cetacean Hearing Augmentation Telemetry) system—a separate interface for limited two-way interaction.
CHAT emits synthetic whistles associated with objects dolphins are known to interact with. The system is designed to detect whether dolphins mimic these whistles, and to alert researchers in real time via underwater headphones.
The latest version, incorporating DolphinGemma, allows for more efficient processing and prediction of these vocalizations in real time.
While the research does not claim to have decoded dolphin language, it represents a step forward in identifying possible structures within their vocal behavior.
Google plans to release DolphinGemma as an open model in the coming months, enabling broader use by researchers working with other cetacean species.