A method for information extraction based on extractive question-answering models and strategies for evaluating and aggregating relevant text fragments

idMartynyuk P.A.

UDC 004.89
DOI: 10.26102/2310-6018/2026.54.3.008

Abstract
List of references
About authors

In the context of accelerated growth of heterogeneous textual data volumes, universal approaches to information extraction that are independent of the specific structure and domain of source texts have become particularly important. Despite the widespread adoption of large generative language models, the problem of accurate and resource-efficient information extraction from textual data remains relevant. While possessing broad capabilities, generative models are often excessive for specialized information retrieval tasks and may demonstrate low interpretability of results. This study is part of research work aimed at developing an alternative method for information extraction from unstructured texts to form a structural model of a text document. The proposed approach focuses on identifying semantically rich text fragments through relevance analysis relative to given thematic aspects of the text. This research presents an information extraction method using an extractive question answering model, based on multi-level answer aggregation combining strategies for assessing text fragment relevance, semantic clustering, and final answer selection for a given question. The proposed approach enables identification of words in the text that are most relevant to the target thematic aspects, which can subsequently be used to extract reliable information from the document. The article presents experimental results confirming the effectiveness of the proposed method in identifying semantically relevant elements of a text document. The obtained results have practical value for developing automated systems of text semantic structure construction and can be applied in document analysis, information retrieval, and intelligent text processing tasks.

1. Xu D., Chen W., Peng W., et al. Large language models for generative information extraction: A survey. Frontiers of Computer Science. 2024;18(6). https://doi.org/10.1007/s11704-024-40555-y

2. Huang L., Yu W., Ma W., et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems. 2025;43(2). https://doi.org/10.1145/3703155

3. Zhao H., Chen H., Yang F., et al. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology. 2024;15(2). https://doi.org/10.1145/3639372

4. Cong X., Yu B., Fang M., et al. Universal information extraction with meta-pretrained self-retrieval. In: Findings of the Association for Computational Linguistics: ACL 2023, 09–14 July 2023, Toronto, Canada. Association for Computational Linguistics; 2023. P. 4084–4100. https://doi.org/10.18653/v1/2023.findings-acl.251

5. Dagdelen J., Dunn A., Lee S., et al. Structured information extraction from scientific text with large language models. Nature Communications. 2024;15. https://doi.org/10.1038/s41467-024-45563-x

6. Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019: Volume 1, 02–07 June 2019, Minneapolis, MN, USA. Association for Computational Linguistics; 2019. P. 4171–4186.

7. Karpukhin V., Oguz B., Min S., et al. Dense Passage Retrieval for Open-Domain Question Answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, 16–20 November 2020, Online. Association for Computational Linguistics; 2020. P. 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550

8. Izacard G., Grave E. Distilling Knowledge from Reader to Retriever for Question Answering. arXiv. URL: https://doi.org/10.48550/arXiv.2012.04584 [Accessed 12th January 2026].

9. Mondal I., Yuan M., Natarajan A., et al. ADAPTIVE IE: Investigating the Complementarity of Human-AI Collaboration to Adaptively Extract Information on-the-fly. In: Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, 19–24 January 2025, Abu Dhabi, UAE. Association for Computational Linguistics; 2025. P. 5870–5889.

10. Ngo N.T., Min B., Nguyen Th.H. Unsupervised domain adaptation for joint information extraction. In: Findings of the Association for Computational Linguistics: EMNLP 2022, 07–11 December 2022, Abu Dhabi, UAE. Association for Computational Linguistics; 2022. P. 5894–5905. https://doi.org/10.18653/v1/2022.findings-emnlp.434

11. Arzideh K., Schäfer H., Allende-Cid H., et al. From BERT to generative AI – Comparing encoder-only vs. large language models in a cohort of lung cancer patients for named entity recognition in unstructured medical reports. Computers in Biology and Medicine. 2025;195. https://doi.org/10.1016/j.compbiomed.2025.110665

12. Berezkin D.V., Kozlov I.A., Martynyuk P.A., Panfilkin A.M. A method for creating structural models of text documents using neural networks. Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering. 2023;12(1):28–45. https://doi.org/10.14529/cmse230102

13. Jain S., Van Zuylen M., Hajishirzi H., Beltagy I. SciREX: A challenge dataset for document-level information extraction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, 05–10 July 2020, Online. Association for Computational Linguistics; 2020. P. 7506–7516. https://doi.org/10.18653/v1/2020.acl-main.670

14. Graesser A.C., McNamara D.S., Louwerse M.M., Cai Zh. Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers. 2004;36(2):193–202. https://doi.org/10.3758/BF03195564

15. Prentice Sh., Knight J., Rayson P., Haj M.E., Rutherford N. Problematising characteristicness: a biomedical association case study. International Journal of Corpus Linguistics. 2021;26(3):305–335. https://doi.org/10.1075/ijcl.19019.pre

Martynyuk Polina Antonovna

Email: martynyuk.pa@bmstu.ru

Scopus | ORCID | eLibrary |

Bauman Moscow State Technical University

Moscow, Russian Federation

Keywords: natural language processing, information extraction, unstructured text, question-answering model, self-attention mechanism

For citation: Martynyuk P.A. A method for information extraction based on extractive question-answering models and strategies for evaluating and aggregating relevant text fragments. Modeling, Optimization and Information Technology. 2026;14(3). URL: https://moitvivt.ru/ru/journal/article?id=2207 DOI: 10.26102/2310-6018/2026.54.3.008 (In Russ).

145

Full text in PDF

Скачать JATS XML

Received 30.01.2026

Revised 07.03.2026

Accepted 17.03.2026

Published 31.03.2026