AI Trends in OCR for the Localization Industry

25.01.2024

Pavel Safonov Head of Desktop Publishing

This article was originally published on multilingual.com.

Optical character recognition, or OCR, is an important step in preparing documents for localization. It requires thorough text processing before submitting a document to the translator. The reason why one should pay attention to document preparation is that not all documents can be edited directly. OCR tools may be required to convert files into an editable format.

In this article, we delve into the intricacies of preparing documents for translation and ascertain how AI can improve the quality, speed, and cost-effectiveness of this process.

What is OCR?

Figure 1. A schematic illustration of OCR process

In the context of this article, optical character recognition is a step in preparing documents for translation. It involves converting non-editable versions of documents (such as PDF, JPG, or TIFF files) into editable versions compatible with CAT tools for subsequent translation.

Various automated image and text recognition tools such as ABBYY FineReader, Adobe Acrobat, and Expert PDF are used for this purpose. To ensure compatibility with CAT tools:

All text to be translated must be editable
Linguistic units should not be broken into parts (no incorrect segmentation)
There should be no unnecessary formatting tags

Figure 2. CAT tool compatibility criteria

The document structure is also crucial. Pay attention to:

Avoiding unnecessary gaps between sections
Customized headers and footers
Automatically generated tables of contents
Automatically created lists
Customized styles
Etc.

All of these factors ensure high quality and faster processing of the document both before and after its translation.

OCR Stages and Their Automation

Figure 3. The stages of OCR

Detecting areas in FineReader

In this stage, recognition areas are detected in accordance with the purpose of the element: table, main text, image, background image with captions.

This process can be performed either automatically or manually. Note that a fully automated process may leave some text unrecognized or the destination incorrectly set. You can speed up the work by using area templates for identical layouts.

Checking text in FineReader

FineReader contains a built-in module for checking dubiously recognized characters. This step is performed by an operator who visually matches fragments that FineReader has detected as incorrectly recognized.

Clearing all text formatting in Word

After transferring the recognized document to Word, it is essential to clear all unnecessary formatting. This helps to prevent excessive tags in CAT tools that can hinder the translator’s work and pollute translation memory. You can use macros to automate this process.

Formatting in Word

Adjust page settings, headers, footers, styles, and lists, and make the document as a whole look as similar as possible to the original. This is a mostly manual task, with hardly any room for automation.

Checking the text in Word

Check spelling and numbers. You can optimize this step by using macros to highlight individual digits, periods, commas, and other special characters, considerably speeding up the checking process.

Where can AI be applied?

AI should target the most time-consuming step or steps with a high error probability.

The rough ranking of steps by duration is the following, starting from the most labor-intensive:

Formatting in Word
Detecting areas in FineReader
Checking text in FineReader
Checking text in Word
Clearing all text formatting in Word

Figure 4. Rough ranking of stages by duration

Let’s have a look at how to use AI in each tool.

FineReader

During the recognition process, FineReader uses AI-based recognition technologies. There is currently no way to improve the automatic area detection feature to speed up this step with the use of existing software. At the same time, FineReader continues to evolve, and we expect that future functionality will allow us to apply AI ourselves.

Word

The most time-consuming formatting step is completely manual and requires a visual comparison to the original.

One way to use AI with an already recognized document would be to incorporate it in the checking process for any spelling, number, and single-character errors. Computer vision technologies are already available. However, it remains uncertain whether an AI tool capable of comparing the original image with recognized text will evolve in the near future.

Using artificial intelligence to verify recognized text without an original image is not very effective. Since there is no access to the original text, the AI has no information about what the recognized text or numbers should be. A true AI gamechanger in this area will be able to compare the original text to the recognized version.

The opportunity to apply AI to OCR continues to evolve

While artificial intelligence can potentially improve some aspects of document preparation for translation, such as spelling and character recognition, current software functionality limits its ability to fully replace the verification stage done by humans.

That said, the opportunity to apply AI to OCR exists and continues to evolve. The underlying question is whether there will be sufficient long-term demand. Assuming that the volume of content requiring OCR decreases as digitalization progresses, the need for OCR will diminish as well.

However, demand for OCR is still strong owing to widespread use of the PDF format. And currently, all indications are that this file format will continue to be regularly used by people all over the world.

Want to discuss a topic or share your experience or opinion? Reach out to us at localize@intext.com or dtp.intext.com!