Evaluation of the quality of intelligent text paraphrasing in Russian

Dagaev A.E., Popov D.I.

UDC 004.89
DOI: 10.26102/2310-6018/2024.47.4.038

Abstract
List of references
About authors

The study focuses on the development of an integral metric for evaluating the quality of text paraphrasing models, addressing the pressing need for comprehensive and objective evaluation methods. Unlike previous research, which predominantly focuses on English-language datasets, this study emphasizes Russian-language datasets, which have remained underexplored until now. The inclusion of datasets such as Gazeta, XL-Sum, and WikiLingua (for Russian) as well as CNN Dailymail and XSum (for English) ensures the multilingual applicability of the proposed approach. The proposed metric combines lexical (ROUGE, BLEU), structural (ROUGE-L), and semantic (BERTScore, METEOR, BLEURT) evaluation criteria, with weights assigned based on the importance of each metric. The results highlight the superiority of ChatGPT-4 on Russian datasets and GigaChat on English datasets, whereas models such as Gemini and YouChat exhibit limited capabilities in achieving semantic accuracy regardless of the dataset language. The originality of this research lies in the integration of multiple metrics into a unified system, enabling more objective and comprehensive comparisons of language models. The study contributes to the field of natural language processing by providing a tool for assessing the quality of language models.

1. Xie J., Agrawal A. Emotion and Sentiment Guided Paraphrasing. In: Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, 13 July 2023, Toronto, Canada. Association for Computational Linguistics; 2023. pp. 58–70. https://doi.org/10.18653/v1/2023.wassa-1.7

2. Krishna K., Song Y., Karpinska M., Wieting J., Iyyer M. Paraphrasing Evades Detectors of AI-Generated Text, but Retrieval is an Effective Defense. In: Advances in Neural Information Processing Systems: 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 10–16 December 2023, New Orleans, USA. Curran Associates; 2024. https://doi.org/10.48550/arXiv.2303.13408

3. Sadasivan V.S., Kumar A., Balasubramanian S., Wang W., Feizi S. Can AI-Generated Text be Reliably Detected? arXiv. URL: https://doi.org/10.48550/arXiv.2303.11156 [Accessed 14th November 2024].

4. Verma D., Lal Y.K., Sinha S., Van Durme B., Poliak A. Evaluating Paraphrastic Robustness in Textual Entailment Models. arXiv. URL: https://doi.org/10.48550/arXiv.2306.16722 [Accessed 14th November 2024].

5. Shen L., Liu L., Jiang H., Shi S. On the Evaluation Metrics for Paraphrase Generation. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 07–11 December 2022, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics; 2022. pp. 3178–3190.

6. Weston J., Lenain R., Meepegama U., Fristed E. Generative Pretraining for Paraphrase Evaluation [Preprint]. arXiv. URL: https://doi.org/10.48550/arXiv.2107.08251 [Accessed 14th November 2024].

7. Sharma S., Joshi A., Mukhija N., Zhao Y., Bhathena H., Singh P., Santhanam S., Biswas P. Systematic review of effect of data augmentation using paraphrasing on Named entity recognition. In: NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, 28 November – 09 December 2022, New Orleans, USA.

8. Han T., Li D., Ma X., Hu N. Comparing product quality between translation and paraphrasing: Using NLP-assisted evaluation frameworks. Frontiers in Psychology. 2022;13. https://doi.org/10.3389/fpsyg.2022.1048132

9. Ahn J., Khosmood F. Evaluation of Automatic Text Summarization using Synthetic Facts. arXiv. URL: https://doi.org/10.48550/arXiv.2204.04869 [Accessed 14th November 2024].

10. Nicula B., Dascalu M., Newton N., Orcutt E., McNamara D.S. Automated Paraphrase Quality Assessment Using Recurrent Neural Networks and Language Models. In: Intelligent Tutoring Systems: 17th International Conference, ITS 2021: Proceedings, 07–11 June 2021, Online. Cham: Springer; 2021. pp. 333–340. https://doi.org/10.1007/978-3-030-80421-3_36

11. Gusev I. Dataset for Automatic Summarization of Russian News. In: Artificial Intelligence and Natural Language: 9th Conference, AINL 2020: Proceedings, 07–09 October 2020, Helsinki, Finland. Cham: Springer; 2020. pp. 122–134. https://doi.org/10.1007/978-3-030-59082-6_9

12. Hasan T., Bhattacharjee A., Islam M.S., Mubasshir K., Li Y.-F., Kang Y.-B., Rahman M.S., Shahriyar R. XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 01–06 August 2021, Online. Association for Computational Linguistics; 2021. pp. 4693–4703. https://doi.org/10.18653/v1/2021.findings-acl.413

13. Ladhak F., Durmus E., Cardie C., McKeown K. WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization. In: Findings of the Association for Computational Linguistics: EMNLP 2020, 16–20 November 2020, Online. Association for Computational Linguistics; 2020. pp. 4034–4048. https://doi.org/10.18653/v1/2020.findings-emnlp.360

14. Nallapati R., Zhou B., Dos Santos C., Gülçehre Ç., Xiang B. Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. In: Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, 11–12 August 2016, Berlin, Germany. Berlin: Association for Computational Linguistics; 2016. pp. 280–290. https://doi.org/10.18653/v1/K16-1028

15. Narayan S., Cohen S.B., Lapata M. Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 31 October – 04 November 2018, Brussels, Belgium. Association for Computational Linguistics; 2018. pp. 1797–1807. https://doi.org/10.18653/v1/D18-1206

16. Patil O., Singh R., Joshi T. Understanding Metrics for Paraphrasing. arXiv. URL: https://doi.org/10.48550/arXiv.2205.13119 [Accessed 14th November 2024].

17. Lin C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, 25–26 July 2004, Barcelona, Spain. Association for Computational Linguistics; 2004. pp. 74–81.

18. Zhang T., Kishore V., Wu F., Weinberger K.Q., Artzi Y. BERTScore: Evaluating Text Generation with BERT. In: Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, 26–30 April 2020, Addis Ababa, Ethiopia. Addis Ababa: International Conference on Learning Representations; 2020. pp. 1–43. https://doi.org/10.48550/arXiv.1904.09675

19. Banerjee S., Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 29 June 2005, Ann Arbor, USA. Association for Computational Linguistics; 2005. pp. 65–72.

20. Post M. A Call for Clarity in Reporting BLEU Scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, 31 October – 01 November 2018, Brussels, Belgium. Association for Computational Linguistics; 2018. pp. 186–191. https://doi.org/10.18653/v1/W18-6319

21. Sellam T., Das D., Parikh A. BLEURT: Learning Robust Metrics for Text Generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 05–10 July 2020, Online. Association for Computational Linguistics; 2020. pp. 7881–7892. https://doi.org/10.18653/v1/2020.acl-main.704

Dagaev Alexander Evgenevich

Moscow Polytechnic University

Moscow, Russian Federation

Popov Dmitry Ivanovich
Doctor of Technical Science, Professor

Sochi State University

Sochi, Russian Federation

Keywords: natural language processing, text paraphrasing, gigaChat, yandexGPT 2, chatGPT-3.5, chatGPT-4, gemini, bing AI, youChat, mistral Large

For citation: Dagaev A.E., Popov D.I. Evaluation of the quality of intelligent text paraphrasing in Russian. Modeling, Optimization and Information Technology. 2024;12(4). URL: https://moitvivt.ru/ru/journal/pdf?id=1763 DOI: 10.26102/2310-6018/2024.47.4.038 (In Russ).

672

Full text in PDF

Received 05.12.2024

Revised 23.12.2024

Accepted 25.12.2024

Published 31.12.2024