Introduction

The client, a government-operated digital preservation library is well-known for its vast collection of historical newspapers. However, they have been facing difficulties with the quality of text recognition in over 30 million newspaper records. These records were originally processed using outdated 2005 OCR technology. To improve their digital archives, they sought assistance from experts.

Problem/Goal

The library was facing a challenge with the poor quality of its current OCR data. This led to many pages being difficult to read, which greatly impacted the accuracy and accessibility of their archives. Their goal was to enhance the OCR quality while also reducing the processing time and cost, which were currently quite high.

Solution

Our team at StandardData chose to implement an advanced, open-source OCR model, drastically improving the text recognition quality. This model was particularly effective in transforming previously unreadable pages into clear, searchable text. Additionally, using open-source OCR provides more flexibility compared to a proprietary model that may become obsolete in the near future.

To improve processing efficiency, we migrated their system to Amazon Web Services (AWS), utilizing a serverless architecture. This allowed us to distribute the processing across hundreds, or even thousands of machines, working in parallel, which significantly accelerated the OCR process.

Revolutionizing Document Management: A Case Study on the Value of Document Extraction and Optical Character Recognition (OCR)

Introduction

Problem/Goal

Solution

Results