The information is prepared on the basis of data from the information-analytical system RSF, informative part is represented in the author's edition. All rights belong to the authors, the use or reprinting of materials is permitted only with the prior consent of the authors.

Project titleMethodology and software framework for developing spreadsheet data extraction systems

Research area 01 - MATHEMATICS, INFORMATICS, AND SYSTEM SCIENCES, 01-509 - Data-mining, databases and knowledge bases

KeywordsInformation extraction, data integration, table understanding, unstructured data management, rule-based programming, generative programming

Annotation
A large volume of arbitrary tables presented in the spreadsheet-like formats (EXCEL, CSV, HTML) circulates in the world. The modern estimations (e.g. Web Data Commons Web Table Corpora or Dresden Web Table Corpus) show that a number of genuine tables in the Web reaches hundreds of millions (http://webdatacommons.org/webtables). They can contain hundreds of billions of facts. A big variety and heterogeneity of layouts, styles, and content, as well as a high rate of growth of their volume characterize the arbitrary tables. This information can be considered as Big Data. The arbitrary tables can be a valuable data source in business intelligence and data-driven research. However, difficulties that inevitably arise with extraction and integration of the tabular data often hinder the intensive use of them in the mentioned areas. Typically, they are not accompanied by explicit semantics necessary for the machine interpretation of their content, as conceived by their author. Their information is often unstructured and not standardized. Analysis of these data requires their preliminary extraction and transformation to a structured representation with a formal model. Today, researchers and developers how face with the above tasks resort to general-purpose tools. Often they offer their own implementations of the same tasks. In comparison with the latter, specialized tools can allow shortening the development time of the target software, hiding inessential details and focusing on the mentioned domain. This is especially important in cases where it is necessary to develop custom or research software in a short time and with a lack of resources for mass processing of weakly structured data from various types of arbitrary tables. The project aims at the development of a framework for creating systems of data extraction from arbitrary spreadsheet tables. The problem covers the tasks of the automatic recovering semantic markup of tables, conceptualization of their natural-language content, data cleaning and lineage, generating relational and linked data, as well as a synthesis of tabular data transformation systems based on table analysis and interpretation rules. The project novelty consists in the development of a theoretical basis and software framework for spreadsheet data transformation from arbitrary to relational form based on rule and generative programming. The project includes the development of a principally novel formal language for table analysis and interpretation that should provide expressing table transformation rules. We also plan to study and implement novel tequniques for automatic recovering semantic markup of short texts presented in tables, for binding extracted data with exteranal conceptual ontologies, and for generating linked data from arbitrary tables. Our framework should draw up this process as consecutive steps: role analysis (extracting functional data items), structural analysis (recovering relationships of functional data items), and interpretation (bindings recovered labels with external dictionaries). Compared to the competitive methodologies and tools we are not limited by a typical table layout, but develop a toolset for generating data transformation program for different table types.

Expected results
The main expected result is a methodology and software framework for creating systems of data extraction from arbitrary spreadsheets. It consists of novel methods and tools for extracting and transforming tabular data presented in heterogeneous unstructured sources of tabular data into a structured form. The results correspond to the state-of-the-art level studies in the area of information extraction. They rely on the modern techniques of the rule-based and generative programming, linked open data, and table understanding. They make a significant contribution to the current state of research in the area of unstructured data management. For the first time, we propose to develop methods and tools for the synthesis of tabular data transformation systems based on table analysis and interpretation rules. The expected results discover new opportunities for intellectualization of the software engineering in scientific and industrial data-intensive applications. The project expands the theoretical knowledge in the integration of heterogeneous table data. The developed software can be used in practice for data science and business intelligence.

Annotation of the results obtained in 2020
BLOCK 1. TABLE EXTRACTION. An ML-based model for the verification of table candidates was introduced. It is useful for the post-processing stage in the case when there are many false-positives predictions. For example, even a well-trained table detection DNN-model can work with a high recall but a low precision. The verification phase allowed us to increase the precision of table detection by 10% jointly with our DNN-model on "ICDAR 2013 Table Competition". It also added 19% to the precision in conjunction with TableBank, one of the best third-party DNN-modes. We improved the scripts for developing DNN-models of the table detection (DL4TD) by extending the collection of training and testing samples. Currently, there are 18900 samples collected from 5 different sources (Marmot, ICDAR-2017, UNLV, SciTSR, "ICDAR2019 cTDaR"). We updated the tool for table extraction from untagged PDF documents (TabbyPDF) by integrating it with our DNN-model (final version) and the verification model for table detection. The implemented solution was evaluated on "ICDAR 2013 Table Competition". The precision was 97% and the recall reached 98%. It should be noted that this is one of the best results among the state-of-the-art academic solutions. BLOCK 2. TABLE ANALYSIS. We developed software implementing the algorithms for correcting a spreadsheet table physical structure (HeadRecog). The study of real-world tables from the SAUS corpora confirmed that, in headers, several physical cells often correspond to one logical cell. Our software allows matching a machine-readable cell structure and the visual one. As a result, it is possible to make both structures more similar by merging physical cells into logical ones. It has been experimentally demonstrated that the correctness of a header cell structure significantly affects the effectiveness of the table analysis and interpretation. The software platform for tabular data extraction and transformation (TabbyXL) was advanced by adding new means of preprocessing, including the implemented HeadRecog algorithms. The effectiveness of the proposed solution was demonstrated on the task of data extracting from statistical tables of the SAUS corpora. A set of 18 user-defined CRL rules was proposed to solve this task. TabbyXL tools allowed us to generate source code that is ready to be built as an executable Java application without any additional modifications. Applying this application for the randomly selected 200 tables of SAUS lead to the following results: F1-measure of entries and labels extraction made up 96.3% and F1-measure for their relation extraction — 93.7% (in the case of corrected tables). BLOCK 3. TABLE INTERPRETATION. A tool (TabbyLD) was developed for semantic interpretation of spreadsheets based on a general-purpose knowledge graph (DBpedia). The generation of linked data in the format of RDF triples based on the transformation of spreadsheets in a canonicalized form was implemented. The F1-score for entity linking of cell values with a target knowledge graph was 58% on T2Dv2, the well-known test dataset. We studied the application of the proposed approach and tool (TabbyLD) for the task of domain ontology engineering in the field of ISI (industrial safety inspection). An extension software module for PKBD, a knowledge base management system, was implemented. This module provides the construction of domain ontologies on the terminological level by using linked data generated automatically from spreadsheets. The experimental results showed the possibility of applying the proposed approach to the formation of prototypes of domain ontologies from tabular data extracted from ISI reports.

Publications

1. Cherepanov I., Mikhailov A., Shigarov A., Paramonov V. On automated workflow for fine-tuning deep neural network models for table detection in document images 2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO), 1130-1133 (year - 2020) https://doi.org/10.23919/MIPRO48935.2020.9245241

2. Dorodnykh N., Yurin A. Towards a universal approach for semantic interpretation of spreadsheets data Proceedings of the 24th Symposium on International Database Engineering & Applications, Article No. 22, 1-9 (year - 2020) https://doi.org/10.1145/3410566.3410609

3. Dorodnykh N., Yurin A. TabbyLD: A tool for semantic interpretation of spreadsheets data Communications in Computer and Information Science, 1341, 315-333 (year - 2021) https://doi.org/10.1007/978-3-030-68527-0_20

4. Mikhailov A., Shigarov A., Rozhkov E., Cherepanov I. On graph-based verification for PDF table detection 2020 Ivannikov ISPRAS Open Conference, 91-95 (year - 2021) https://doi.org/10.1109/ISPRAS51486.2020.00020

5. Paramonov V., Shigarov A., Vetrova V. Table header correction algorithm based on heuristics for improving spreadsheet data extraction Communications in Computer and Information Science, 1283, 147-158 (year - 2020) https://doi.org/10.1007/978-3-030-59506-7_13

6. Yurin A., Dorodnykh N. Personal knowledge base designer: Software for expert systems prototyping SoftwareX, 11, 100411 (year - 2020) https://doi.org/10.1016/j.softx.2020.100411

7. Yurin A., Dorodnykh N. Experimental evaluation of a spreadsheets transformation in the context of domain model engineering 2020 Ural Symposium on Biomedical Engineering, Radioelectronics and Information Technology (USBEREIT), 0388-0391 (year - 2020) https://doi.org/10.1109/USBEREIT48449.2020.9117674

Annotation of the results obtained in 2018
We developed a heuristics-based algorithm for recovering the physical structure of cells presented in arbitrary spreadsheet tables. The algorithm explores the text alignment and visual borders of physical cells (tiles) to produce combined logical cells. It enables transforming the structure of physical (syntactic) cells into logical (semantic) cells. Note that the state-of-the-art methods for the cell structure recognition deal with low-level documents (bitmaps or print instructions). Unlike them, our approach intends to correct a high-level physical structure of spreadsheet tables. This allows us to avoid some errors that are caused by the low-level representation of documents. We developed a prototype of the software library for processing a natural language (NL) content of tables. This relies on free software developed by "Stanford NLP Group". The prototype provides extracting some regular named entities from tabular data. The named entities can be involved in the process of table analysis and interpretation. In particular, they enables separating numeric entries from non-numeric labels, grouping labels by their types (entities), or associating extracted data items with external vocabularies (ontologies). We created an artificial neural network (ANN) model for table detection on document images. It is based on fine-tuning for pre-trained models. We used "Faster R-CNN", an architecture for the object detection. Note that the document engineering community actively developed deep-learning based methods for the table detection last two years. The novelty of our approach consists in studying and applying an augmentation of a training dataset for the first time. The augmented dataset allowed us to improve the accuracy of table detection by 5%. The performance evaluation of the model showed a high recall and precision for the table detection on a competition dataset. The results are comparable with the best academic solutions. We advanced CRL, our domain-specific language of table analysis and interpretation rules. This language determines queries (conditions) and operations (actions) that are necessary to develop programs for spreadsheet data transformation from an arbitrary to relational form. CRL rules expressed as productions map a physical structure of cells (layout, style and text features) to a logical structure (linked functional data items such as entries, labels, and categories). In comparison with general-purpose rule languages (such as DRL, Jess, RuleML), the advanced version of the language enables expressing rulesets without any instructions for management of the working memory (such as updates of modified facts, or blocks on the rule re-activation). This provides syntactically simplifying declaration of the right-side hand of CRL rules. Our advanced language allows end-users to focus more on the logic of table analysis and interpretation than on the logic of the rule management and execution. We developed an interpreter of CRL-rules. It provides translating CRL rulesets (declarative programs) to Java source code (imperative programs). The generated source code is ready for compilation and building of executable programs for domain-specific spreadsheet data extraction and transformation. Many of the existing solutions with similar goals use predefined table models embed into their internal algorithms. Such systems usually support only a few widespread layout types of tables. Unlike them, our software platform defines a general-purpose table model that does not restrict layout types. It allows expressing user-defined layout, style, and text features of arbitrary tables in external CRL rules. In comparison to our competitors, we support not only widespread layout types of arbitrary tables, but also specific ones. The empirical results show the applicability of our software platform for development of programs for the spreadsheet data extraction and transformation. We designed an architecture and component model of the software platform for development of programs for the spreadsheet data extraction and transformation. As an important part of the architecture, our two-layered table object model was extended to support named entities in both physical and logical layer. Note that the existing table models usually rely on predefined functional regions of cells. Unlike them, our model does not associate functions (roles) with cells. Instead of the regular approach, we bind functions with data items (entries and labels) originated from cells. This enables supporting a table layout where one cell contains two or more data items. Such layout feature can be found in bilingual or pivot tables. The novelty of the software platform consist in providing two rule-based ways to implement workflows of spreadsheet data extraction and transformation. In the first case, a ruleset for table analysis and interpretation is expressed in a general-purpose rule language and executed by a JSR-94-compatible rule engine (e.g. Drools or Jess). In the second case, our interpreter translates a ruleset expressed in CRL to Java source code that is complicated and executed by the Java development kit. We suggested a novel method of semantic table interpretation based on natural language processing (NLP) techniques and external vocabularies (domain-specific and general-purpose ontologies). It uses a NLP-based algorithm to separate extracted data items of a table into two types: numeric and non-numeric. Then it links non-numeric data items with concepts (classes, objects, and properties) of an external vocabulary by using the semantic similarity. The current version of the method supports only DBpedia as such external ontology but it can be extended by other vocabularies in further. On the base of the method, we developed a prototype of the software tool for generating RDF/OWL linked data from extracted tabular data. The prototype implements the following algorithms: (i) tabular data cleansing and formatting in accordance with DBpedia naming conventions; (ii) creating queries to DBpedia in SPARQL language; (iii) and linking extracted data items of a table with DBpedia classes, objects, and properties. The prototype was implemented as web-based application on Yii2 framework. The linking of extracted tabular data with the global structure of linked open data (LOD cloud) allows interpreting them in the terms of external ontologies in third-party software applications. Our solution is designed to be a part of an end-to-end process of the table understanding implemented by our software platform. In this environment, it can be applied for the semantic interpretation of tables with an arbitrary layout. This is the main difference of our solution from existing methods. The developed components of our platform are published as free software. The source code repositories are listed at https://tabbydoc.github.io.

Publications

1. Shigarov A.O., Khristyuk V.V., Paramonov V.V., Yurin A.Yu., Dorodnykh N.O. Toward Framework for Development of Spreadsheet Data Extraction Systems Proc. 1st Workshop on Information Technologies: Algorithms, Models, Systems. CEUR Workshop Proceedings, vol. 2221, pp. 90-96 (year - 2018)

2. - Сибирские ученые — среди победителей «молодежных» конкурсов президентской программы исследовательских проектов 2018 года Наука в Сибири, № 26 (3137), с. 7 (year - )

Annotation of the results obtained in 2019
The research results of this project year aimed at developing the theoretical foundations laid down in the first year as well as implementing them as software. The work plan of research was completed successfully. The research results were published in a number of scientific articles and supplementary software and data, including artificial neural network models, deep learning pipelines, experimental methodologies, wiki-documentation, a demo Docker-container, and CodeOcean-capsule. The scientific contribution of the project consists in that we advanced the theoretical foundations and developed software tools intended to the full cycle of machine understanding of tabular data, including the table extraction, analysis, and interpretation stages, based on the contemporary apparatus (deep learning, rule engines, natural language processing, and open linked data). The practical significance is justified by providing new possibilities of the software development for data extraction and transformation from arbitrary tables represented in a print-oriented document (PDF) and spreadsheet (Excel) formats that are popular for representing scientific, government, and business information. The research results we obtained on this project year are briefly considered in 3 declared blocks below. BLOCK 1. Table extraction. We designed a common workflow of developing deep neural network (DNN) models for the table detection in document images based on Faster R-CNN, the well-known DNN-architecture. The workflow covers the stages of the preparation of training data, the configuration of the training parameters, as well as the performance evaluation of the resulting models on competitive collections. We implemented a complex of Python-scripts to automate this workflow (https://github.com/tabbydoc/dl4td). These scripts have allowed us to reduce the efforts of experts required to study various options for preparing training samples and configurations of training parameters as well as to search for the best of them. Using the proposed and automated workflow, we created a new deep neural network model for detecting tables in document images. The performance evaluation of this model showed a high accuracy of predictions on positions (bounding boxes) of tables. Note that the accuracy was increased by comparing it to the preliminary models obtained in the previous project year. We develop rule-based algorithms of page layout analysis in untagged PDF documents. Our algorithms adapt T-Recs (Kieninger & Dengel, 1998), the well-known technique for clustering text blocks originally intended for processing document images. We extended original rules text block composition by new ones using PDF-specific information (the order of rendering and formatting of text and graphics). We created a model for verification of table predictions. The verifier examines a graph-based representation of text arrangement inside a bounding box of a predicted table by using decision trees. This allowed us to improve the accuracy of the proposed table detection in untagged PDF-documents by reducing errors among predictions. The performance evaluation of our DNN-model involving the verifier showed the high accuracy of table detection on the competitive document collection: the precision is 97% and the recall is 98.4%. These results are comparable with the best contemporary academic solutions. BLOCK 2. Table analysis. We developed a software library of preprocessing algorithms to assist the table analysis and interpretation. Its current version includes NER-algorithms for annotating the natural language text of cells, as well as for cleaning the physical structure of the table header. We expect that this solution can be useful for the main phases of table understanding. In particular, recovering NER-annotation can be used as additional data to separate table content into functional parts, as well as to categorize extracted data items. The novel algorithms were created for correcting the physical structure of a table header cells in accordance with the visual representation of their borders. They allow us to reduce errors of the table analysis and interpretation in comparison to some cases of inappropriate structure. We refined the translator of rules for table analysis and interpretation represented in CRL, our domain-specific language, to compiled programs in Java, the general-purpose programming language. The enhancement consists of the simplified grammar of CRL language, a new object model of CRL-rules, as well as the generation of Java source code. Additionally, we developed a generator of Maven-projects to build executable applications for the transformation of spreadsheet tables to a canonical form. The generated applications have the basic functionality of spreadsheet data canonicalization and can be applied without additional programmer efforts. Our solution can be useful to force software development for rule-driven data extraction and transformation from spreadsheet tables with a complex layout. The implemented tools were integrated with TabbyXL, our software platform for the table understanding, we developed in this project. Its source code was published in open access (https://github.com/tabbydoc/tabbyxl), including the accompanying wiki-documentation (https://github.com/tabbydoc/tabbyxl/wiki). BLOCK 3. Table interpretation. We developed algorithms for tabular data conceptualization by using techniques of natural language processing and open linked data. The cover data cleaning, named entity recognition and linking, generation of target tables annotated with URL-links to concepts of an external knowledge graph (DBpedia). Using these algorithms, we developed a web-based tool for generating linked data from extracted tabular data. The tool provides the semantic annotation for tables in a canonical form with recovered named entities, as well as the synthesis of semantically annotated spreadsheets. We developed a tool for constructing conceptual models (domain ontologies) from linked data extracted from spreadsheet tables. This tool is an extension module (plug-in) for the knowledge management software «Personal Knowledge Base Designer». The preliminary empirical results we obtained showed the applicability of this solution to automate the formation of conceptual models from tabular reports produced by the industrial safety inspection. The additional information can be found at the site of this research project (http://td.icc.ru). The developed software is published on GitHub resources (https://tabbydoc.github.io).

Publications

1. Cherkashin E., Shigarov A., Paramonov V., Mikhailov A. Digital archives supporting document content inference 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2019 - Proceedings, 1037-1042 (year - 2019) https://doi.org/10.23919/MIPRO.2019.8757196

2. Dorodnykh N., Yurin A. Towards ontology engineering based on transformation of conceptual models and spreadsheet data: a case study Advances in Intelligent Systems and Computing, 1046, 233-247 (year - 2019) https://doi.org/10.1007/978-3-030-30329-7_22

3. Dorodnykh N., Yurin A. Software conception for semantic interpretation CEUR Workshop Proceedings, 2463, 76-83 (year - 2019)

4. Dorodnykh N., Yurin A., Shigarov A. Conceptual model engineering for industrial safety inspection based on spreadsheet data analysis Communications in Computer and Information Science, 1126, 51-65 (year - 2020) https://doi.org/10.1007/978-3-030-39237-6_4

5. Paramonov V., Shigarov A., Vetrova V., Mikhailov A. Heuristic algorithm for recovering a physical structure of spreadsheet header Advances in Intelligent Systems and Computing, 1050, 140-149 (year - 2020) https://doi.org/10.1007/978-3-030-30440-9_14

6. Shigarov A., Cherepanov I., Cherkashin E., Dorodnykh N., Khristyuk V., Mikhailov A., Paramonov V., Rozhkow E., Yurin A. Towards end-to-end transformation of arbitrary tables from untagged portable documents (PDF) to linked data CEUR Workshop Proceedings, 2463, 1-12 (year - 2019)

7. Shigarov A., Khristyuk V., Mikhailov A. TabbyXL: software platform for rule-based spreadsheet data extraction and transformation SoftwareX, 10 (year - 2019) https://doi.org/10.1016/j.softx.2019.100270

8. Shigarov A., Khristyuk V., Mikhailov A., Paramonov V. TabbyXL: rule-based spreadsheet data extraction and transformation Communications in Computer and Information Science, 1078, 59-75 (year - 2019) https://doi.org/10.1007/978-3-030-30275-7_6

9. Shigarov A., Khristyuk V., Mikhailov A., Paramonov V. Software development for rule-based spreadsheet data extraction and transformation 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics, MIPRO 2019 - Proceedings, 1132-1137 (year - 2019) https://doi.org/10.23919/MIPRO.2019.8756829

10. Yurin A., Dorodnykh O. A reverse engineering process for inferring conceptual models from canonicalized tables 2019 International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON), 0485-0490 (year - 2020) https://doi.org/10.1109/SIBIRCON48586.2019.8958458