REAL-WORLD SUCCESS

OCR Processing Dataflow at Tremendous Scale

Introduction

The federal agency in question required a comprehensive and efficient solution for large-scale Optical Character Recognition (OCR) processing of historical newspapers. These newspapers, with their less-than-optimal OCR, were causing significant performance and accessibility issues within the data set. This vital data set is frequently utilized by academic researchers who rely on its accuracy to perform in-depth analysis of historical data, ultimately uncovering new and enlightening pieces of history. Benefits that StandardData provided to agency included:
 
  • Accuracy: The OCR implementation handled complex layouts and made text more searchable with fewer errors
  • Speed: Instead of weeks, 350GB of images was processed in hours
  • Savings: Not only did the costs of running OCR go down, so did the storage and infrastructure costs

Problem/Goal

Executing OCR processing on historical newspapers at a vast scale presents a formidable challenge. The complex layouts and intricate formatting inherent to newspapers make OCR implementation particularly difficult. In response, our team carefully selected the most appropriate OCR technology for the task and expertly designed a parallelized processing pipeline. This innovative solution was achieved using a leading government cloud products vendor.

Solution

Our teams expertise and commitment to excellence enabled us to harness the power of relevant open-source technologies, ensuring that our client's technical and organizational requirements were met. By architecting a robust distributed processing pipeline, we successfully parallelized the OCR workflow, driving impressive results for the federal agency at a reduced cost. This solid and empowering solution reflects our team's dedication to wise, nurturing stewardship and unwavering confidence in the face of complex challenges.

Results

 StandardData’s dataflow processor reduced processing time from weeks to hours to run OCR on over 350 GB of data. Not only did StandardData’s solution yield the lowest processing times and lowest cost, it also created an environment of minimal IT overhead for the client.