OCR Processing Dataflow at Tremendous Scale


The Need

The federal agency in question required a comprehensive and efficient solution for large-scale Optical Character Recognition (OCR) processing of historical newspapers. These newspapers, with their less-than-optimal OCR, were causing significant performance and accessibility issues within the data set. This vital data set is frequently utilized by academic researchers who rely on its accuracy to perform in-depth analysis of historical data, ultimately uncovering new and enlightening pieces of history.

The Challenge

Executing OCR processing on historical newspapers at a vast scale presents a formidable challenge. The complex layouts and intricate formatting inherent to newspapers make OCR implementation particularly difficult. In response, our team carefully selected the most appropriate OCR technology for the task and expertly designed a parallelized processing pipeline. This innovative solution was achieved using a leading government cloud products vendor.

The Solution

Our teams expertise and commitment to excellence enabled us to harness the power of relevant open-source technologies, ensuring that our client's technical and organizational requirements were met. By architecting a robust distributed processing pipeline, we successfully parallelized the OCR workflow, driving impressive results for the federal agency at a reduced cost. This solid and empowering solution reflects our team's dedication to wise, nurturing stewardship and unwavering confidence in the face of complex challenges.

The Result

StandardData’s dataflow processor reduced processing time from weeks to hours to run OCR on over 350 GB of data. Not only did StandardData’s solution yield the highest processing times and lowest cost, it also created an environment of minimal overhead for the client.