Resumo: |
This study evaluates the application of Natural Language Processing (NLP) techniques for analyzing quarterly earnings call transcripts from Brazilian banks, with a focus on comparing traditional topic modeling methods—Latent Dirichlet Allocation (LDA) and BERTopic—with advanced Large Language Models (LLMs), including GPT-4-turbo, Llama3, and Qwen2. The research is structured into three benchmark tasks: (1) comparing traditional NLP methods with GPT-4-turbo for unstructured topic modeling, (2) benchmarking GPT-4-turbo, Llama3, and Qwen2 in unstructured topic modeling using the innovative "LLM-as-a-Judge" framework, and (3) evaluating LLMs for structured topic modeling and sentiment analysis using labeled datasets. The results reveal the limitations of traditional models in capturing nuanced, domain-specific content due to their reliance on bag-of-words and clustering techniques, particularly in small, homogeneous datasets. Conversely, LLMs demonstrated superior performance, leveraging pre-trained architectures to generate contextually rich and coherent outputs without requiring dataset-specific training. Among the LLMs, GPT-4-turbo consistently outperformed others across tasks, achieving higher scores in coherence, accuracy, and contextual relevance. Open-source models like Qwen2 showed promise as resource-efficient alternatives, though with reduced consistency compared to GPT-4-turbo. The study highlights the evolving methodologies for evaluating modern NLP models, emphasizing the inadequacy of traditional metrics like coherence and UMass scores for assessing LLM outputs. By incorporating a hybrid evaluation approach—combining structured benchmarks, qualitative assessments, and the "LLM-as-a-Judge" framework—this research provides a comprehensive method for model comparison. These findings underline the transformative potential of LLMs in domain-specific applications and suggest pathways for future advancements in NLP evaluation techniques and model scalability. |
---|