Abstract:
This study evaluates the effectiveness of three popular large language models (LLMs)—GPT-3.5-Turbo, GPT-4o-Mini, and Gemini-2.0-Flash—in identifying Myers-Briggs Type Indicator (MBTI) personality traits from textual data. Using a dataset of over 8,000 personality-labeled posts, each model's capability to accurately classify individual MBTI letters (Introversion-Extraversion, Sensing-Intuition, Thinking-Feeling, Judging-Perceiving) and the full four-letter personality types was assessed. The Gemini-2.0-Flash model consistently demonstrated superior performance across all metrics, achieving the highest exact-match accuracy (72%) and balanced F1-scores. In contrast, GPT-3.5-Turbo showed strong biases toward majority classes, especially in distinguishing between intuitive and sensing traits, while GPT-4o-Mini presented marked improvements but remained less effective overall compared to Gemini-2.0-Flash. The results underscore the importance of class-sensitive metrics such as recall and F1-score, highlighting that high accuracy alone can mask significant model biases, especially within imbalanced datasets. Future research should focus on class balancing, fine-tuning for fairness, and extending the analysis to alternative personality frameworks and diverse text sources.