INFORMATION ABOUT PROJECT,
SUPPORTED BY RUSSIAN SCIENCE FOUNDATION
The information is prepared on the basis of data from the information-analytical system RSF, informative part is represented in the author's edition. All rights belong to the authors, the use or reprinting of materials is permitted only with the prior consent of the authors.
Project titleMethodology and software framework for developing spreadsheet data extraction systems
Project LeadShigarov Alexey
AffiliationMatrosov Institute for System Dynamics and Control Theory SB RAS,
Implementation period2018 - 2020
Research area 01 - MATHEMATICS, INFORMATICS, AND SYSTEM SCIENCES, 01-509 - Data-mining, databases and knowledge bases
KeywordsInformation extraction, data integration, table understanding, unstructured data management, rule-based programming, generative programming
A large volume of arbitrary tables presented in the spreadsheet-like formats (EXCEL, CSV, HTML) circulates in the world. The modern estimations (e.g. Web Data Commons Web Table Corpora or Dresden Web Table Corpus) show that a number of genuine tables in the Web reaches hundreds of millions (http://webdatacommons.org/webtables). They can contain hundreds of billions of facts. A big variety and heterogeneity of layouts, styles, and content, as well as a high rate of growth of their volume characterize the arbitrary tables. This information can be considered as Big Data. The arbitrary tables can be a valuable data source in business intelligence and data-driven research. However, difficulties that inevitably arise with extraction and integration of the tabular data often hinder the intensive use of them in the mentioned areas. Typically, they are not accompanied by explicit semantics necessary for the machine interpretation of their content, as conceived by their author. Their information is often unstructured and not standardized. Analysis of these data requires their preliminary extraction and transformation to a structured representation with a formal model. Today, researchers and developers how face with the above tasks resort to general-purpose tools. Often they offer their own implementations of the same tasks. In comparison with the latter, specialized tools can allow shortening the development time of the target software, hiding inessential details and focusing on the mentioned domain. This is especially important in cases where it is necessary to develop custom or research software in a short time and with a lack of resources for mass processing of weakly structured data from various types of arbitrary tables. The project aims at the development of a framework for creating systems of data extraction from arbitrary spreadsheet tables. The problem covers the tasks of the automatic recovering semantic markup of tables, conceptualization of their natural-language content, data cleaning and lineage, generating relational and linked data, as well as a synthesis of tabular data transformation systems based on table analysis and interpretation rules. The project novelty consists in the development of a theoretical basis and software framework for spreadsheet data transformation from arbitrary to relational form based on rule and generative programming. The project includes the development of a principally novel formal language for table analysis and interpretation that should provide expressing table transformation rules. We also plan to study and implement novel tequniques for automatic recovering semantic markup of short texts presented in tables, for binding extracted data with exteranal conceptual ontologies, and for generating linked data from arbitrary tables. Our framework should draw up this process as consecutive steps: role analysis (extracting functional data items), structural analysis (recovering relationships of functional data items), and interpretation (bindings recovered labels with external dictionaries). Compared to the competitive methodologies and tools we are not limited by a typical table layout, but develop a toolset for generating data transformation program for different table types.
The main expected result is a methodology and software framework for creating systems of data extraction from arbitrary spreadsheets. It consists of novel methods and tools for extracting and transforming tabular data presented in heterogeneous unstructured sources of tabular data into a structured form. The results correspond to the state-of-the-art level studies in the area of information extraction. They rely on the modern techniques of the rule-based and generative programming, linked open data, and table understanding. They make a significant contribution to the current state of research in the area of unstructured data management. For the first time, we propose to develop methods and tools for the synthesis of tabular data transformation systems based on table analysis and interpretation rules. The expected results discover new opportunities for intellectualization of the software engineering in scientific and industrial data-intensive applications. The project expands the theoretical knowledge in the integration of heterogeneous table data. The developed software can be used in practice for data science and business intelligence.
Annotation of the results obtained in 2018
We developed a heuristics-based algorithm for recovering the physical structure of cells presented in arbitrary spreadsheet tables. The algorithm explores the text alignment and visual borders of physical cells (tiles) to produce combined logical cells. It enables transforming the structure of physical (syntactic) cells into logical (semantic) cells. Note that the state-of-the-art methods for the cell structure recognition deal with low-level documents (bitmaps or print instructions). Unlike them, our approach intends to correct a high-level physical structure of spreadsheet tables. This allows us to avoid some errors that are caused by the low-level representation of documents. We developed a prototype of the software library for processing a natural language (NL) content of tables. This relies on free software developed by "Stanford NLP Group". The prototype provides extracting some regular named entities from tabular data. The named entities can be involved in the process of table analysis and interpretation. In particular, they enables separating numeric entries from non-numeric labels, grouping labels by their types (entities), or associating extracted data items with external vocabularies (ontologies). We created an artificial neural network (ANN) model for table detection on document images. It is based on fine-tuning for pre-trained models. We used "Faster R-CNN", an architecture for the object detection. Note that the document engineering community actively developed deep-learning based methods for the table detection last two years. The novelty of our approach consists in studying and applying an augmentation of a training dataset for the first time. The augmented dataset allowed us to improve the accuracy of table detection by 5%. The performance evaluation of the model showed a high recall and precision for the table detection on a competition dataset. The results are comparable with the best academic solutions. We advanced CRL, our domain-specific language of table analysis and interpretation rules. This language determines queries (conditions) and operations (actions) that are necessary to develop programs for spreadsheet data transformation from an arbitrary to relational form. CRL rules expressed as productions map a physical structure of cells (layout, style and text features) to a logical structure (linked functional data items such as entries, labels, and categories). In comparison with general-purpose rule languages (such as DRL, Jess, RuleML), the advanced version of the language enables expressing rulesets without any instructions for management of the working memory (such as updates of modified facts, or blocks on the rule re-activation). This provides syntactically simplifying declaration of the right-side hand of CRL rules. Our advanced language allows end-users to focus more on the logic of table analysis and interpretation than on the logic of the rule management and execution. We developed an interpreter of CRL-rules. It provides translating CRL rulesets (declarative programs) to Java source code (imperative programs). The generated source code is ready for compilation and building of executable programs for domain-specific spreadsheet data extraction and transformation. Many of the existing solutions with similar goals use predefined table models embed into their internal algorithms. Such systems usually support only a few widespread layout types of tables. Unlike them, our software platform defines a general-purpose table model that does not restrict layout types. It allows expressing user-defined layout, style, and text features of arbitrary tables in external CRL rules. In comparison to our competitors, we support not only widespread layout types of arbitrary tables, but also specific ones. The empirical results show the applicability of our software platform for development of programs for the spreadsheet data extraction and transformation. We designed an architecture and component model of the software platform for development of programs for the spreadsheet data extraction and transformation. As an important part of the architecture, our two-layered table object model was extended to support named entities in both physical and logical layer. Note that the existing table models usually rely on predefined functional regions of cells. Unlike them, our model does not associate functions (roles) with cells. Instead of the regular approach, we bind functions with data items (entries and labels) originated from cells. This enables supporting a table layout where one cell contains two or more data items. Such layout feature can be found in bilingual or pivot tables. The novelty of the software platform consist in providing two rule-based ways to implement workflows of spreadsheet data extraction and transformation. In the first case, a ruleset for table analysis and interpretation is expressed in a general-purpose rule language and executed by a JSR-94-compatible rule engine (e.g. Drools or Jess). In the second case, our interpreter translates a ruleset expressed in CRL to Java source code that is complicated and executed by the Java development kit. We suggested a novel method of semantic table interpretation based on natural language processing (NLP) techniques and external vocabularies (domain-specific and general-purpose ontologies). It uses a NLP-based algorithm to separate extracted data items of a table into two types: numeric and non-numeric. Then it links non-numeric data items with concepts (classes, objects, and properties) of an external vocabulary by using the semantic similarity. The current version of the method supports only DBpedia as such external ontology but it can be extended by other vocabularies in further. On the base of the method, we developed a prototype of the software tool for generating RDF/OWL linked data from extracted tabular data. The prototype implements the following algorithms: (i) tabular data cleansing and formatting in accordance with DBpedia naming conventions; (ii) creating queries to DBpedia in SPARQL language; (iii) and linking extracted data items of a table with DBpedia classes, objects, and properties. The prototype was implemented as web-based application on Yii2 framework. The linking of extracted tabular data with the global structure of linked open data (LOD cloud) allows interpreting them in the terms of external ontologies in third-party software applications. Our solution is designed to be a part of an end-to-end process of the table understanding implemented by our software platform. In this environment, it can be applied for the semantic interpretation of tables with an arbitrary layout. This is the main difference of our solution from existing methods. The developed components of our platform are published as free software. The source code repositories are listed at https://tabbydoc.github.io.
1. Шигаров А.О., Христюк В.В., Парамонов В.В., Юрин А.Ю., Дородных Н.О. Toward Framework for Development of Spreadsheet Data Extraction Systems Proc. 1st Workshop on Information Technologies: Algorithms, Models, Systems. CEUR Workshop Proceedings, vol. 2221, pp. 90-96 (year - 2018).
2. - Сибирские ученые — среди победителей «молодежных» конкурсов президентской программы исследовательских проектов 2018 года Наука в Сибири, № 26 (3137), с. 7 (year - ).