REAL-WORLD SUCCESS

Revolutionizing Document Management: A Case Study on the Value of Document Extraction and Optical Character Recognition (OCR)

Introduction

The client, a government-operated digital preservation library is well-known for its vast collection of historical newspapers. However, they have been facing difficulties with the quality of text recognition in over 30 million newspaper records. These records were originally processed using outdated 2005 OCR technology. To improve their digital archives, they sought assistance from experts.

Problem/Goal

The library was facing a challenge with the poor quality of its current OCR data. This led to many pages being difficult to read, which greatly impacted the accuracy and accessibility of their archives. Their goal was to enhance the OCR quality while also reducing the processing time and cost, which were currently quite high.

Solution

Our team at StandardData chose to implement an advanced, open-source OCR model, drastically improving the text recognition quality. This model was particularly effective in transforming previously unreadable pages into clear, searchable text. Additionally, using open-source OCR provides more flexibility compared to a proprietary model that may become obsolete in the near future.
 
To improve processing efficiency, we migrated their system to Amazon Web Services (AWS), utilizing a serverless architecture. This allowed us to distribute the processing across hundreds, or even thousands of machines, working in parallel, which significantly accelerated the OCR process.

Results

10X improvement in OCR quality:
Successfully converted previously unreadable records into fully legible text.

94% reduction in cost:
Reduced from $300 per batch to $20 per batch of records.

99% speed improvement:
Improved batch processing time from 3 weeks to 1 hour.