AN INTERACTIVE TOOL FOR EXTRACTING LOW-QUALITY SPREADSHEET TABLES AND CONVERTING INTO RELATIONAL DATABASE

Awad, Arwa; Roushdy, Mohamed Ismail; ElGohary, Rania Abd ElRahman; Moawad, Ibrahim Fathy

doi:10.21608/ijicis.2021.51197.1045

AN INTERACTIVE TOOL FOR EXTRACTING LOW-QUALITY SPREADSHEET TABLES AND CONVERTING INTO RELATIONAL DATABASE

Document Type : Original Article

Authors

¹ Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt

² Faculty of Computer and Information Technology, Future University in Egypt, Cairo, Egypt

10.21608/ijicis.2021.51197.1045

Abstract

Spreadsheets are contained critical information on various topics and most broadly utilized in numerous spaces. There are a huge amount of spreadsheets clients around the world. As a result of their convenience, support for announcing and portrayal as diagrams and graphs and gives their makers an enormous level of opportunity in encoding their data as it simple to utilize. Tables produce a large amount of spreadsheet data. The expansion in volume and complexity of tables has prompted expanded necessities to preserve this data and reuse it. However, spreadsheets are hard to arrange with other data sources. As a result, it makes data stored in spreadsheets with low-quality.
We exhibited an automated extractor tool that gives the standard client a chance to concentrate on extracted relational tables from spreadsheets without experience in any programming language besides high-quality data extraction. The paper executed novel algorithms based on a heuristic approach for table extraction from a spreadsheet and implemented data improvement and quality rules using domain ontology for changing over between low-quality semi-structured data to high-quality relational data for reusability and integration as a Java program interfacing with SQL server database. The paper does experiments on 2 real public datasets. The percentage of improving the performance using the proposed approach on the 2 datasets are 100 % for extracting duplicated records and the percentage of successfully table identified are 100% and 85% respectively.

Keywords