Evaluasi Akurasi dan Presisi Large Language Model (LLM) dalam Generasi User Story untuk Perangkat Lunak

Maulana Nur Rokhim; Muhammad Akmaluddin Az Zamrudi; Muhammad Ainul Yaqin

doi:10.35316/jimi.v10i1.48-60

Authors

Maulana Nur Rokhim Universitas Islam Negeri Mulana Malik Ibrahim
Muhammad Akmaluddin Az Zamrudi UNIVERSITAS ISLAM NEGERI MAULANA MALIK IBRAHIM MALANG
Muhammad Ainul Yaqin UNIVERSITAS ISLAM NEGERI MAULANA MALIK IBRAHIM MALANG

DOI:

https://doi.org/10.35316/jimi.v10i1.48-60

Keywords:

Akurasi, Presisi, User Story, Perangkat Lunak, Large Language Model (LLM)

Abstract

Generating effective user stories is essential yet time-consuming in software development, especially in large scale Agile projects. This study evaluates the performance of three Large Language Models (LLMs): ChatGPT-4.0, DeepSeek, and Gemini 2.5 in generating user stories automatically. The objective is to compare their accuracy and precision to determine the most suitable model for automating requirements documentation. Using seven test prompts from various industry domains, each model generated user stories evaluated with BLEU-4, ROUGE-L F1, and METEOR metrics. Results show that while all models produced structurally valid outputs, Gemini 2.5 achieved the highest average scores (0.386), surpassing DeepSeek (0.355) and ChatGPT (0.348). Gemini 2.5 demonstrated superior consistency, clarity, and semantic completeness. This research contributes a performance benchmark for LLMs in software requirement generation and highlights the practical benefits of LLM-based automation over manual methods, including speed, consistency, and adaptability. Gemini 2.5 is recommended as the optimal model for generating user stories in software engineering contexts.

Downloads

Download data is not yet available.

References

[1] T. G. S. Filó, M. A. S. Bigonha, and K. A. M. Ferreira, “Evaluating Thresholds for Object-Oriented Software Metrics,” J. Brazilian Comput. Soc., vol. 30, no. 1, pp. 313–346, 2024, doi: 10.5753/jbcs.2024.3373.

[2] OpenAI, “ChatGPT: Optimizing Language Models for Dialogue.” Accessed: Jun. 09, 2025. [Online]. Available: https://openai.com/blog/chatgpt

[3] Google DeepMind, “Gemini: Multimodal Reasoning and Generation,” DeepMind. Accessed: Jun. 09, 2025. [Online]. Available: https://deepmind.google/technologies/gemini

[4] K. Mohiuddin et al., “Retention Is All You Need,” Int. Conf. Inf. Knowl. Manag. Proc., no. Nips, pp. 4752–4758, 2023, doi: 10.1145/3583780.3615497.

[5] E. M. De Bortoli Fávero and D. Casanova, “BERT_SE: A Pre-Trained Language Representation Model for Software Engineering,” arXiv Prepr., pp. 115–130, 2021, doi: 10.5121/csit.2021.111909.

[6] J. von der Mosel, A. Trautsch, and S. Herbold, “On the validity of pre-trained transformers for natural language processing in the software engineering domain,” Lect. Notes Informatics (LNI), Proc. - Ser. Gesellschaft fur Inform., vol. P-332, no. 8, pp. 93–94, 2023.

[7] S. Cahyawijaya et al., “IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation,” EMNLP 2021 - 2021 Conf. Empir. Methods Nat. Lang. Process. Proc., pp. 8875–8898, 2021, doi: 10.18653/v1/2021.emnlp-main.699.

[8] M. Barbella, M. Risi, G. Tortora, and A. Auriemma Citarella, “Different Metrics Results in Text Summarization Approaches,” DATA, no. Data, pp. 31–39, 2022, doi: 10.5220/0011144000003269.

[9] B. Hendrickx, “Meteor,” Sp. Explor. Humanit. a Hist. Encycl. Vol. 1-2, vol. 1–2, no. June, pp. 344–346, 2010, doi: 10.1145/2567940.

[10] K. Schwaber and M. Beedle, Agile Software Development with Scrum. 2001. United States: Prentice Hall PTR, 2003. [Online]. Available: https://www.amazon.co.uk/Agile-Software-Development-SCRUM-Schwaber/dp/0130676349

[11] B. Jayaraman, C. Guo, and K. Chaudhuri, “D’ej`a Vu Memorization in Vision-Language Models,” arXiv Prepr., no. NeurIPS, 2024, [Online]. Available: http://arxiv.org/abs/2402.02103

[12] S. Sholiq, M. A. Yaqin, A. P. Subriadi, and B. Setiawan, “Generation of business process modeling notation diagrams from textual functional requirements in Indonesian,” Int. J. Electr. Comput. Eng., vol. 15, no. 3, pp. 2938–2950, 2025, doi: 10.11591/ijece.v15i3.pp2938-2950.

[13] K. Ronanki, B. Cabrero-daniel, and C. Berger, “ChatGPT as a tool for User Story Quality Evaluation : Trustworthy Out of the Box ?,” 2023.

[14] G. Lucassen, F. Dalpiaz, J. M. E. M. van der Werf, and S. Brinkkemper, “Improving User Story Practice with the Linguistic Quality Framework,” Requir. Eng., vol. 21, no. 3, pp. 383–403, 2016, doi: 10.1007/s00766-016-0250-x.

[15] R. Hida, J. Ohmura, and T. Sekiya, “Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation,” arXiv Prepr., 2024, [Online]. Available: http://arxiv.org/abs/2406.16356

[16] J. L. Whitten and L. D. Bentley, Systems Analysis and Design Methods, 7th ed. in McGraw-Hill higher education. McGraw-Hill Education, 2005. [Online]. Available: https://books.google.co.id/books?id=jAclAQAAIAAJ

[17] W. X. Zhao et al., “A Survey of Large Language Models,” pp. 1–144, 2025.

Evaluasi Akurasi dan Presisi Large Language Model (LLM) dalam Generasi User Story untuk Perangkat Lunak

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

ringmenu

template

visitor

member

citation

Current Issue

Keywords

Evaluasi Akurasi dan Presisi Large Language Model (LLM) dalam Generasi User Story untuk Perangkat Lunak

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

login

ringmenu

template

visitor

member

citation

Current Issue

Keywords