New study reveals AI’s breakthrough in decoding human emotions with multimodal precision
A new study from the University of Cambridge and Hunan University has tested how well advanced AI models recognize human emotions in real-life situations. For the first time, researchers evaluated 19 leading multimodal systems to see which methods work best. Their findings highlight that combining audio, video, and text delivers the most accurate results—especially when analyzing complex emotional expressions.
The team examined both open-source and closed-source models, discovering that freely available systems perform nearly as well as proprietary ones. Unlike earlier studies that focused on basic emotions, this research explored a wide, open-ended range of expressions.
Video data played a crucial role in improving accuracy, outperforming audio or text alone. Models like Google’s Gemini—which processes text, images, audio, and video—were among those assessed. Meanwhile, Alibaba’s Qwen series and DeepSeek’s reinforcement-learning models also showed strong capabilities in reasoning and multimodal tasks. Beyond raw performance, the study explored techniques to refine AI reasoning. Methods such as chain-of-thought prompting, self-consistency checks, and step-by-step refinement were tested to see how they influence emotion recognition. These strategies aim to make models more transparent and reliable in interpreting human feelings.
The research confirms that multimodal AI has made significant strides in understanding emotions. By integrating video, audio, and text, models achieve higher accuracy than single-input systems. Open-source alternatives now rival closed-source tools, broadening access to advanced emotion-recognition technology.