Evaluasi Akurasi dan Presisi Large Language Model (LLM) dalam Generasi User Story untuk Perangkat Lunak
DOI:
https://doi.org/10.35316/jimi.v10i1.48-60Keywords:
Akurasi, Presisi, User Story, Perangkat Lunak, Large Language Model (LLM)Abstract
Generating effective user stories is essential yet time-consuming in software development, especially in large scale Agile projects. This study evaluates the performance of three Large Language Models (LLMs): ChatGPT-4.0, DeepSeek, and Gemini 2.5 in generating user stories automatically. The objective is to compare their accuracy and precision to determine the most suitable model for automating requirements documentation. Using seven test prompts from various industry domains, each model generated user stories evaluated with BLEU-4, ROUGE-L F1, and METEOR metrics. Results show that while all models produced structurally valid outputs, Gemini 2.5 achieved the highest average scores (0.386), surpassing DeepSeek (0.355) and ChatGPT (0.348). Gemini 2.5 demonstrated superior consistency, clarity, and semantic completeness. This research contributes a performance benchmark for LLMs in software requirement generation and highlights the practical benefits of LLM-based automation over manual methods, including speed, consistency, and adaptability. Gemini 2.5 is recommended as the optimal model for generating user stories in software engineering contexts.
Downloads
References
[1] T. G. S. Filó, M. A. S. Bigonha, and K. A. M. Ferreira, “Evaluating Thresholds for Object-Oriented Software Metrics,” J. Brazilian Comput. Soc., vol. 30, no. 1, pp. 313–346, 2024, doi: 10.5753/jbcs.2024.3373.
[2] OpenAI, “ChatGPT: Optimizing Language Models for Dialogue.” Accessed: Jun. 09, 2025. [Online]. Available: https://openai.com/blog/chatgpt
[3] Google DeepMind, “Gemini: Multimodal Reasoning and Generation,” DeepMind. Accessed: Jun. 09, 2025. [Online]. Available: https://deepmind.google/technologies/gemini
[4] K. Mohiuddin et al., “Retention Is All You Need,” Int. Conf. Inf. Knowl. Manag. Proc., no. Nips, pp. 4752–4758, 2023, doi: 10.1145/3583780.3615497.
[5] E. M. De Bortoli Fávero and D. Casanova, “BERT_SE: A Pre-Trained Language Representation Model for Software Engineering,” arXiv Prepr., pp. 115–130, 2021, doi: 10.5121/csit.2021.111909.
[6] J. von der Mosel, A. Trautsch, and S. Herbold, “On the validity of pre-trained transformers for natural language processing in the software engineering domain,” Lect. Notes Informatics (LNI), Proc. - Ser. Gesellschaft fur Inform., vol. P-332, no. 8, pp. 93–94, 2023.
[7] S. Cahyawijaya et al., “IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation,” EMNLP 2021 - 2021 Conf. Empir. Methods Nat. Lang. Process. Proc., pp. 8875–8898, 2021, doi: 10.18653/v1/2021.emnlp-main.699.
[8] M. Barbella, M. Risi, G. Tortora, and A. Auriemma Citarella, “Different Metrics Results in Text Summarization Approaches,” DATA, no. Data, pp. 31–39, 2022, doi: 10.5220/0011144000003269.
[9] B. Hendrickx, “Meteor,” Sp. Explor. Humanit. a Hist. Encycl. Vol. 1-2, vol. 1–2, no. June, pp. 344–346, 2010, doi: 10.1145/2567940.
[10] K. Schwaber and M. Beedle, Agile Software Development with Scrum. 2001. United States: Prentice Hall PTR, 2003. [Online]. Available: https://www.amazon.co.uk/Agile-Software-Development-SCRUM-Schwaber/dp/0130676349
[11] B. Jayaraman, C. Guo, and K. Chaudhuri, “D’ej`a Vu Memorization in Vision-Language Models,” arXiv Prepr., no. NeurIPS, 2024, [Online]. Available: http://arxiv.org/abs/2402.02103
[12] S. Sholiq, M. A. Yaqin, A. P. Subriadi, and B. Setiawan, “Generation of business process modeling notation diagrams from textual functional requirements in Indonesian,” Int. J. Electr. Comput. Eng., vol. 15, no. 3, pp. 2938–2950, 2025, doi: 10.11591/ijece.v15i3.pp2938-2950.
[13] K. Ronanki, B. Cabrero-daniel, and C. Berger, “ChatGPT as a tool for User Story Quality Evaluation : Trustworthy Out of the Box ?,” 2023.
[14] G. Lucassen, F. Dalpiaz, J. M. E. M. van der Werf, and S. Brinkkemper, “Improving User Story Practice with the Linguistic Quality Framework,” Requir. Eng., vol. 21, no. 3, pp. 383–403, 2016, doi: 10.1007/s00766-016-0250-x.
[15] R. Hida, J. Ohmura, and T. Sekiya, “Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation,” arXiv Prepr., 2024, [Online]. Available: http://arxiv.org/abs/2406.16356
[16] J. L. Whitten and L. D. Bentley, Systems Analysis and Design Methods, 7th ed. in McGraw-Hill higher education. McGraw-Hill Education, 2005. [Online]. Available: https://books.google.co.id/books?id=jAclAQAAIAAJ
[17] W. X. Zhao et al., “A Survey of Large Language Models,” pp. 1–144, 2025.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Jurnal Ilmiah Informatika

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

