What do you do if you want to have texts on images and videos read by a computer? At CentERdata we employ Optical Character Recognition (OCR). OCR is also used in other forms of non-digitally readable media where text extraction is desired. Consider, for example, automatic number-plate recognition, automatic processing of PDFs, or digitizing scanned or photographed documents.
OCR is widely used to digitize collections of text works, such as books, reports or old archives. In particular libraries, national archives, and museums utilize OCR and similar techniques to this end. Also Google is involved in digitizing books on a massive scale in collaboration with publishers. However, OCR should not be directly linked to Handwritten Text Recognition (HTR), which is much more complicated than recognizing printed text due to the wide variation in writing methods, especially in ancient writings.
In OCR, the locations of texts on images are first detected. Following is a clear segmentation of the letters for character recognition. Finally, a classification of the characters takes place, which is the heart of text recognition. Thus, at least two layers of deep learning are involved; for detecting texts and for classifying texts. Deep learning can also be applied for character segmentation, but this is usually done through projection segmentation.
CentERdata is experienced in the application of OCR techniques. Aside from character recognition, many more techniques are involved in the complete OCR process. An important task is, for example, the preprocessing of a photo or document to make the text properly machine-readable. The texts are normalized, whereby brightness, noise, contrast and orientation are treated.
After texts are read, post-processing takes place, for example by applying an automatic spelling correction and an evaluation of the accuracy of text recognition.
Information extraction from thousands of PDFs
Companies are required to submit annual reports. These are available online and sometimes date back several decades. Extracting information from these, especially scanned PDFs can be a daunting task. An important aspect here is to extract information quickly and with high accuracy. This involves a lot of technical expertise, which goes beyond just knowledge and skills in OCR.
For this project we work together with Foundation for Auditing Research (FAR). We are data science consultants developing and improving a tailor-made OCR tool for automatically reading and digitizing thousands of PDFs (annual reports, figures, images, board reports, etc.). The aim is that the OCR tool can be applied repeatedly on annually published reports. After the reports have been accurately read, the next step is to extract specific information from these documents using text mining techniques.