Novel Ontologies-based Optical Character Recognition-error Correction Cooperating with Graph Component Extraction
Abstract
literature. Extracting graph information clearly contributes to readers, who are interested in graph information interpretation, because we can obtain significant information presenting in the graph. A typical tool used to transform image-based characters to computer editable characters is optical character recognition (OCR). Unfortunately, OCR cannot guarantee perfect results, because it is sensitive to noise and input quality. This becomes a serious problem because misrecognition provides misunderstanding information to readers and causes misleading communication. In this study, we present a novel method for OCR-error correction based on bar graphs using semantics, such as ontologies and dependency parsing. Moreover, we used a graph component extraction proposed in our previous study to omit irrelevant parts from graph components. It was applied to clean and prepare input data for this OCR-error correction. The main objectives of this paper are to extract significant information from the graph using OCR and to correct OCR errors using semantics. As a result, our method provided remarkable performance with the highest accuracies and F-measures. Moreover, we examined that our input data contained less of noise because of an efficiency of our graph component extraction. Based on the evidence, we conclude that our solution to the OCR problem achieves the objectives.
Keywords
OCR-error correction, post-processing, dependency parsing, ontology, graph-component extraction