Abstract:
Understanding people’s emotions is a valuable task with wide-ranging applications across multiple domains. Beyond traditional sentiment analysis based on a single data type, such as text, the emergence and advancement of multimodal large language models (MLLMs) have enabled this analysis to encompass various approaches, including emotion recognition in user-generated content (UGC) that involves both text and images. This study evaluates the effectiveness of four state-of-the-art multimodal models — VisualBERT, CLIP, Shikra, and Otter — in detecting emotions from both textual and visual data. These models are trained and validated using the EmoReact, AFEW, and SFEW datasets, capturing diverse emotional cues and employing evaluation metrics such as accuracy, precision, F1 score, and the CIDEr score, which measures the alignment between model-generated and human-interpreted emotional content.
Following this comparative assessment, we apply the top-performing models to specific contexts: genre-classified IMDb movie stills and TripAdvisor hotel reviews focusing on Caribbean hotels. This application aims to analyze how textual-visual congruence and incongruence may enrich insights into user experiences in both contexts. Preliminary results highlight the strengths, weaknesses and best approaches of each model so that they can be efficiently applied to new application contexts. This study suggests that textimage incongruence may provide enriched, multi-faceted consumer insights, which could enhance user experience analysis, particularly for applications in hospitality and media content assessment. The work contributes to advancing multimodal emotion analysis methods, suggesting ways to optimize UGC processing for real-world, data-driven applications.