Evaluating Large Language Models for Emotion Classification and Contextual Word Prediction in NLP

Murat Eser

doi:10.29329/ijiasr.2026.1442.1

Araştırma Makalesi | Açık Erişim
Uluslararası Fen Araştırmalarında Yenilikçi Yaklaşımlar Dergisi 2026, Cilt 10(1) 1-16

Evaluating Large Language Models for Emotion Classification and Contextual Word Prediction in NLP

Murat Eser

Bu makaleye atıf yap

Makale Türü

Araştırma Makalesi

Yayın Tarihi

June 07, 2026

Sayfa Sayısı

1-16

DOI

10.29329/ijiasr.2026.1442.1

Öz

Large Language Models (LLMs) are powerful deep learning models capable of understanding, interpreting, and generating natural language with high accuracy. These models significantly facilitate the processing of large-scale and complex data, providing substantial ease and efficiency in various natural language processing (NLP) tasks, including text classification, sentiment analysis, contextual understanding, and automatic content generation. This research was conducted to evaluate the sentiment analysis and contextual understanding capabilities of LLMs in the NLP domain. The study contributes to the literature by providing an integrated evaluation that assesses LLMs not only for their generative and classification capabilities but also for their ability to maintain and predict semantic integrity. A balanced dataset consisting of five emotion categories was used to test classification and fill-in-the-blank tasks with ChatGPT 5.3, Claude Sonnet 4.6, and Gemini 3.1 Pro models using the zero-shot method. In the classification task, model performance was evaluated using accuracy, precision, recall, and F1-score metrics. The results revealed that the Claude Sonnet 4.6 model demonstrated superior performance by achieving a 99.52% accuracy score. In the fill-in-the-blank task, the semantic similarity between the predicted words, the original words, and the completed sentences were measured using SBERT and cosine similarity. In this task, Gemini 3.1 Pro achieved the highest similarity performance with scores of 0.85 for word similarity and 0.94 for sentence similarity. The findings indicate that the examined LLMs generally exhibited high success in emotion classification and contextual word prediction tasks. Particularly, Sonnet 4.6 performed stronger in classification, while Gemini 3.1 Pro showed greater strength in semantic fill-in-the-blank tasks. These results highlight the potential of LLMs in understanding and completing emotion-bearing texts in everyday language, thereby underscoring their importance in NLP research.

Anahtar Kelimeler:

Large Language Models Natural Language Processing Classification Semantic Similarity Analysis Sentiment Classification

Bu makaleye atıf yap

Eser, M. (2026). Evaluating Large Language Models for Emotion Classification and Contextual Word Prediction in NLP. Uluslararası Fen Araştırmalarında Yenilikçi Yaklaşımlar Dergisi, 10(1), 1-16. https://doi.org/10.29329/ijiasr.2026.1442.1

Annepaka, Y., & Pakray, P. (2025). Large language models: a survey of their development, capabilities, and applications. Knowledge and Information Systems, 67(3), 2967–3022. [Google Scholar]

Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., & Wang, Y. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3), 1–45. [Google Scholar]

Feng, H., Ronzano, F., LaFleur, J., Garber, M., De Oliveira, R., Rough, K., Roth, K., Nanavati, J., El Abidine, K. Z., & Mack, C. (2024). Evaluation of large language model performance on the biomedical language understanding and reasoning benchmark: Comparative study. MedRxiv, 2024–2025. [Google Scholar]

Jiang, P., Xiao, C., Wang, Z., Bhatia, P., Sun, J., & Han, J. (2024). Trisum: Learning summarization ability from large language models with structured rationale. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2805–2819. [Google Scholar]

Lahitani, A. R., Permanasari, A. E., & Setiawan, N. A. (2016). Cosine similarity to determine similarity measure: Study case in online essay assessment. 2016 4th International Conference on Cyber and IT Service Management, 1–6. [Google Scholar]

Leon, M. (2024). Benchmarking large language models with a unified performance ranking metric. International Journal in Foundations of Computer Science & Technology, 4. [Google Scholar]

Li, J., Bian, Y., Wang, G., Lei, Y., Cheng, D., Ding, Z., & Jiang, C. (2023). Cfgpt: Chinese financial assistant with large language model. ArXiv Preprint ArXiv:2309.10654. [Google Scholar]

Ma, C., Wu, Z., Wang, J., Xu, S., Wei, Y., Liu, Z., Zeng, F., Jiang, X., Guo, L., & Cai, X. (2024). An iterative optimizing framework for radiology report summarization with ChatGPT. IEEE Transactions on Artificial Intelligence, 5(8), 4163–4175. [Google Scholar]

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ArXiv Preprint ArXiv:1301.3781. [Google Scholar]

Min, B., Ross, H., Sulem, E., Veyseh, A. P. Ben, Nguyen, T. H., Sainz, O., Agirre, E., Heintz, I., & Roth, D. (2023). Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2), 1–40. [Google Scholar]

Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., & Mian, A. (2025). A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 16(5), 1–72. [Google Scholar]

Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. [Google Scholar]

Raiaan, M. A. K., Mukta, M. S. H., Fatema, K., Fahad, N. M., Sakib, S., Mim, M. M. J., Ahmad, J., Ali, M. E., & Azam, S. (2024). A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access, 12, 26839–26874. [Google Scholar]

Shao, M., Basit, A., Karri, R., & Shafique, M. (2024). Survey of different large language model architectures: Trends, benchmarks, and challenges. IEEE Access, 12, 188664–188706. [Google Scholar]

Sindhu, B., Prathamesh, R. P., Sameera, M. B., & KumaraSwamy, S. (2024). The evolution of large language model: Models, applications and challenges. 2024 International Conference on Current Trends in Advanced Computing (ICCTAC), 1–8. [Google Scholar]

Song, X., Xie, K., Lee, L., Chen, R., Clark, J. M., He, H., He, H., Min, J., Zhang, X., & Zheng, S. (2025). Performance evaluation of large language models in statistical programming. ArXiv Preprint ArXiv:2502.13117. [Google Scholar]

Su, C.-Y., & McMillan, C. (2024). Distilled GPT for source code summarization. Automated Software Engineering, 31(1), 22. [Google Scholar]

Tan, E., & Liu, H. (2022). Performance Comparison of Seven Pretrained Models on a text classification task. Proceedings of the 2022 5th International Conference on Signal Processing and Machine Learning, 8–12. [Google Scholar]

Tintin, R., & Yücebaş, S. C. (2026). Duygu-Turk: A Context-Aware Sentiment Analysis Framework for Turkish, Based on Plutchik’s Emotion Model. Journal of Universal Computer Science, 32(4). [Google Scholar]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30. [Google Scholar]

Venkatesh Sharma, K., Ayiluri, P. R., Betala, R., Jagdish Kumar, P., & Shirisha Reddy, K. (2024). Enhancing query relevance: leveraging SBERT and cosine similarity for optimal information retrieval. International Journal of Speech Technology, 27(3), 753–763. [Google Scholar]

Veziroğlu, M., & Bucak, İ. (2025). Haber Sınıflandırma Sistemlerinde Naive Bayes ve Makine Öğrenmesi Algoritmaları Arasında Performans Karşılaştırması. Journal of the Institute of Science and Technology, 15(1), 57–70. [Google Scholar]

Xu, H., & Ashley, K. (2023). Argumentative segmentation enhancement for legal summarization. ArXiv Preprint ArXiv:2307.05081. [Google Scholar]

Zhu, J., Li, J., Wen, Y., & Guo, L. (2024). Benchmarking large language models on CFLUE-a Chinese financial language understanding evaluation dataset. Findings of the Association for Computational Linguistics: ACL 2024, 5673–5693. [Google Scholar]

Makale Menüsü

Volume 10 (2026)
Tüm Makaleler Makale Bilgileri Benzer Makaleler

PDF İndir

Makale Metrikleri

Görüntülenme

İndirme

Atıf

APA BibTeX RIS

Evaluating Large Language Models for Emotion Classification and Contextual Word Prediction in NLP

Kaynakça