Imagine a world where the history of your data is at your fingertips, clearer and more accessible than ever before. With the advent of open-source Optical Character Recognition (OCR) technologies, organizations are quickly realizing historical datasets into accessible, structured data readily accessible at their fingertips. StandardData is at the forefront of this movement, empowering organizations to simplify complex data interactions and make valuable insights more attainable than ever. Most recently, a major federal government institution was concerned with the quality of text recognition and, thus, the quality of the data of its 30 million newspaper records. By implementing advanced, open-source OCR technology, we didn't just improve the text recognition quality; we revolutionized it. We drastically improved the text recognition quality by 10x and cut the cost of running their data processing pipeline by 50%. In addition, we moved the data to Amazon Web Services (AWS), which increased speed by 99%. The best part? It was all based on a serverless infrastructure, so now they pay for what they use and nothing more.
There is a lot of talk about utilizing Artificial Intelligence (AI) to optimize and transform your data. However, your data needs to be cleaned and prepared first. As part of that, OCR technology has emerged as a valuable business solution that enables organizations to automate data extraction from printed or handwritten text found in scanned documents or image files. This technology is designed to recognize and convert text characters into machine-readable data that computers can easily process. OCR technology can convert a wide range of printed and handwritten documents, including books, newspapers, invoices, receipts, and forms, among others.
In the digital age, preserving historical records is a critical mission, and OCR is key to its success. Many governments and organizations worldwide are taking the necessary steps to digitize their archives -- some make them accessible to the public, and some do it to improve internal process efficiency. Our client has been at the forefront of this effort, managing government-operated digital preservation, and is widely recognized for its massive collection of historical newspapers. Despite their impressive collection, their digital collections faced a significant challenge due to the poor quality of their current OCR data. The existing OCR technology, which dates back to 2005, was no longer sufficient for their needs, and many of the pages were difficult to read. This, in turn, impacted the accuracy and accessibility of their archives, which was a major concern. To overcome this challenge, our client sought the assistance of our experts with a clear goal: to enhance OCR quality while reducing the processing time and cost, which were currently quite high.
After conducting research and analysis, our team at StandardData decided to implement an advanced, open-source OCR model that would significantly improve text recognition quality. This model was specifically designed to convert previously unreadable pages into clear, searchable text, making it ideal for our client's needs. Using an open-source OCR model provided more flexibility compared to a proprietary model that may become obsolete in the near future.
To show the impact that a 10X improvement in OCR quality can make, we've included an example of one document, an old OCR transcript, and our new, improved version below.
ORIGINAL DOCUMENT:
Old OCR Transcript:
New OCR Transcript:
The AI industry is pushing hard to advance the capabilities of model architectures. Still, problems arise when AI consulting firms spend their customers' money developing proprietary algorithms/models that simply will not compete with open-source counterparts in the long term. StandardData stands in contrast, maintaining that firms' competitive advantages lay in their data, NOT whatever AI model is used to consume that data into actionable intelligence.
Therefore, we do not recommend our clients invest in proprietary models; instead, we recommend leveraging open-source AI models that improve year over year through the sheer will of the open-source workforce alone. By doing so, organizations can invest in their data and its supportive infrastructure by taking advantage of the great work and models produced by the open-source community!
Previously, our client utilized an on-site server for OCR tasks with a single instance setup. This meant the client could only process one image at a time, leading to significant overhead costs for hardware maintenance, electricity, and permissions management. The process was time-consuming and costly. In addition, the server's CPU was nearly a decade old, rendering it inefficient and slow. Consequently, the cost of the server's electricity expenses alone was actually more expensive than opting for a faster, serverless, and parallelized solution offered by AWS.
In response to these challenges, we transitioned the digital collections's system to AWS, embracing a serverless architecture that vastly improved processing efficiency. AWS was selected for its capability to simultaneously distribute the processing workload across hundreds or thousands of machines, thereby significantly speeding up the OCR process. This move to the cloud introduced numerous advantages, such as enhanced flexibility, scalability, cost-effectiveness, and heightened security. AWS's comprehensive suite of services was customized to address the client's unique requirements, offering the flexibility to adjust resources promptly by demand fluctuations, eliminating the financial burden associated with physical server upkeep. Finally, AWS's advanced security protocols safeguard sensitive information and applications, underscoring the benefits of migrating to a cloud-based solution for our client.
The outcome of the project was nothing short of remarkable. Our team achieved a 10X improvement in OCR quality, successfully converting previously unreadable pages into fully legible text. Moreover, there was a 94% reduction in cost, from $300 per batch to $20 per batch. The time taken to complete the implementation also significantly improved, from three weeks to just one hour.
To summarize, by utilizing open-source OCR and shifting to the cloud, the project exceeded its goals and enabled our client to offer precise and easily accessible archives to its patrons. Our team at StandardData is proud to have contributed to this digital preservation effort, and we look forward to more such opportunities in the future.
If your organization is looking to leverage historical datasets, please contact us for more information.