Keywords: patent, data extraction, device components, dependency trees, SAO
Extraction of morphological features of technical systems from Russian patents using dependency tree analysis
UDC 004.853
DOI: 10.26102/2310-6018/2022.39.4.006
The article presents a methodology for extracting morphological features of technical systems in the form of device components and connections between them. The main section of Russian patents claims is chosen as the subject of the study for data extraction. Information about device components is the most fundamental and important part. It can be used in many tasks of computer-aided patent analysis, while the search for effective approaches to extracting such information is still in progress. In the present inquiry, computer-aided development of inventions is considered as a range of applications for such data. The aim of the study was to explore the quality of data extraction using dependency tree analysis for Russian language. The dependency tree is the result of markup by natural language processing tools. Several parsers were chosen for the comparison: UdPipe, Stanza, DeepPavlov and spaCy. The output data are presented in the form of semantic SAO (Subject-Action-Object) structures. The quality of data extraction has been evaluated using precision, recall and F1 metrics. For this purpose, 20 patent claims with 252 SAO structures were manually marked. Under the current methodological constraints, we were able to extract from the dataset 79 % of the SAO structures at best according to the recall metric with a non-strict data evaluation, i.e. without accounting for the completeness of noun groups. The value of F1-measure is lower and ranges from 48 % to 66 % depending on the evaluation type. Conclusions are drawn about the current level of the syntactic analyzer performance within the field of application under review. The results can be useful for developing efficient approaches to extracting structured data from Russian patent arrays.
1. Li X., Song H., Zhang X., Xu Q. Fine-grained Construction of Semantic Technology Network for Technology Evolution Analysis. Proc. of the 3rd International Conference on Computer Science and Application Engineering. 2019:1–7. DOI: 10.1145/3331453.3361638.
2. You H., Li M., Hipel K.W. et al. Development trend forecasting for coherent light generator technology based on patent citation network analysis. Scientometrics. 2017;111:297–315. DOI: 10.1007/s11192-017-2252-y.
3. Kim S., Yoon B. Patent infringement analysis using a text mining technique based on SAO structure. Computers in Industry. 2021;125:103379. DOI: 10.1016/j.compind.2020.103379.
4. Feng L., Niu Y., Wang J. Development of Morphology Analysis-Based Technology Roadmap Considering Layer Expansion Paths: Application of TRIZ and Text Mining. Applied Sciences. 2020;10(23):8498. DOI: 10.3390/app10238498.
5. Liu L., Li Y., Xiong Y., Cavallucci, D. A new function-based patent knowledge retrieval tool for conceptual design of innovative products. Computers in Industry. 2020;115:103154. DOI: 10.1016/j.compind.2019.103154.
6. Zaripova V.M., Petrova I.Yu., Tsyrulnikov E.S. Classification of automated systems of support for innovation precesses at enterprises (Computer aided innovation – CAI). Prikaspiiskii zhurnal: upravlenie i vysokie tekhnologii = Caspian journal management and high technologies. 2012;1(17):26–35. (In Russ.). Available by: https://elibrary.ru/download/elibrary_17708904_18434426.pdf (accessed on: 20.10.2022).
7. Vasiliev S.S., Korobkin D.M., Fomenkov S.A. method of domain ontology automated replenishment for the support of new technical solutions synthesis. Part I. Vestnik komp'yuternykh i informatsionnykh tekhnologii = Herald of computer and information technologies. 2021;18(11):3–12. (In Russ.). DOI: 10.14489/vkit.2021.11.pp.003-012.
8. Boting G., Wenqing W. Open Relation Extraction in Patent Claims with a Hybrid Network. Wireless Communications and Mobile Computing. 2021;2021(1):1–7. DOI: 10.1155/2021/5547281.
9. Yang S.-Y., Soo V.-W. Extract conceptual graphs from plain texts in patent claims. Engineering Applications of Artificial Intelligence. 2012;25(4):874–887. DOI: 10.1016/j.engappai.2011.11.006.
10. Lyashevskaya O.N., Shavrina T.O., Trofimov I.V., Vlasova N.A. Grameval 2020 Shared Task: Russian Full Morphology And Universal Dependencies Parsing. Proc. of the International Conference «Dialogue 2020». 2020:553–569. DOI: 10.28995/2075-7182-2020-19-553-569.
11. Ki W., Kim K. Generating Information Relation Matrix Using Semantic Patent Mining for Technology Planning: A Case of Nano-Sensor. IEEE Access. 2017;5:26783–26797. DOI: 10.1109/access.2017.2771371.
12. Lin W., Liu X., Xiao R. Research on Product Core Component Acquisition Based on Patent Semantic Network. Entropy (Basel). 2022;24(4):549. DOI: 10.3390/e24040549.
13. Honnibal M., Montani I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear. 2017.
14. Yindi S., Wei L., Guozhong C., Qingjin P., Jianjie G., Jiaming F. Effective design knowledge abstraction from Chinese patents based on a meta-model of the patent design knowledge graph. Computers in Industry. 2022;142:103749. DOI: 10.1016/j.compind.2022.103749.
15. Krestel R., Chikkamath R., Hewel C., Risch J. A survey on deep learning for patent analysis. World Patent Information. 2021;65:102035. DOI: 10.1016/j.wpi.2021.102035
16. Chen L., Xu S., Zhu L., Zhang J., Lei X., Yang G. A deep learning based method for extracting semantic information from patent documents. Scientometrics. 2020;125:289–312. DOI: 10.1007/s11192-020-03634-y.
17. Xueqiang L., Xiangru L., Xindong Y., Zhian D., Junmei H. Relation Extraction Toward Patent Domain Based on Keyword Strategy and Attention+BiLSTM Model (Short Paper). Proc. of the 15th EAI International Conference, CollaborateCom. 2019. DOI: 10.1007/978-3-030-30146-0_28.
18. Kolesnikova V., Korobkin D., Fomenkov S., Rayushkin E., Glushkin V. The Analysis of Technology Development Trends Based on the Network Semantic Structure «Subject-Action-Object». Cyber-Physical Systems: Intelligent Models and Algorithms. Studies in Systems, Decision and Control. 2022;417:43–53. DOI: 10.1007/978-3-030-95116-0_4.
19. Straka M., Hajič J., Straková J. UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. Proc. of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 2016:4290–4297. Available by: https://aclanthology.org/L16-1680.pdf (accessed on: 20.10.2022).
20. Qi P., Zhang Y., Zhang Y., Bolton J., Manning C.D. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. Association for Computational Linguistics (ACL) System Demonstrations. 2020. Available by: https://nlp.stanford.edu/pubs/qi2020stanza.pdf (accessed on: 20.10.2022).
21. Burtsev M. et al. DeepPavlov: Open-Source Library for Dialogue Systems. Proc. of ACL 2018, System Demonstrations. 2018:122–127. DOI: 10.18653/v1/P18-4021.
Keywords: patent, data extraction, device components, dependency trees, SAO
For citation: Vasiliev S.S., Korobkin D.M., Fomenkov S.A. Extraction of morphological features of technical systems from Russian patents using dependency tree analysis. Modeling, Optimization and Information Technology. 2022;10(4). URL: https://moitvivt.ru/ru/journal/pdf?id=1246 DOI: 10.26102/2310-6018/2022.39.4.006 (In Russ).
Received 20.10.2022
Revised 15.11.2022
Accepted 25.11.2022
Published 31.12.2022