2012-03-22

Improving scanned PDFs for translation reference

It' quite common these days to receive scanned documents from faxes or other sources as PDFs. These can be easy or rather devilish to convert to editable text using a variety of tools, but in some cases, they are simply wanted for reference. How do you search a large, scanned PDF document for a particular bit of text?

Mostly you don't.

Unless, of course, you are clever and convert the PDF to one of the various "text-on-image" PDF formats. If you are scanning hardcopy documents, it is also possible with many scanning applications to convert the input directly to such a format.

I use ABBYY FineReader 11 to make my scanned reference PDFs searchable. This is a quick and easy process that can be performed two ways.

The first and quickest method is to use the context menu by right-clicking on a PDF or image file in the Windows Explorer.

This creates a temporary, searchable PDF which can be saved under whatever name you like. I do this for documents which serve purely as references, where I have no interest in extracting text for translation. It has the disadvantage with FineReader of working with whatever defaults are in place for the last language used.

The second method involves importing the image document into the OCR program, then saving as a searchable document after OCR. This may be useful for documents that have more than one language, where you may apply different OCR settings (for languages) on various pages.

If automatic conversion is used (usually not recommended if you plan to extract text for translation), the process can be rather quick as well, though it is a bit more cumbersome than the context menu method. For example, a 114 page scanned German insurance policy from which I had to translate excerpts was imported from the original PDF, read (processed by optical character recognition) and saved as a PDF/A (searchable text on image PDF which is the current ISO standard for long-term archiving) in slightly less than 4 minutes.

Here's a screenshot of the text search in the PDF/A document using the Adobe Reader. Without this conversion process, it would be impossible to find any text in the document using search functions, because the entire content would be bitmapped images.

Even if you do very careful OCR to extract text for translation, defining zones and optimizing as I do, there are still significant advantages to making a searchable PDF as a reference. First of all, it is often very useful to see text in its proper layout context. Secondly, doing this also helps to identify and correct OCR errors during translation work. I recently translated a scientific article with horrible resolution in the faxed and scanned source document. It was definitely a borderline case for OCR, and when I imported it into a CAT tool for translation, I had to look up a number of places in the original document to see what the text really said. Copying the errors from the source of the OCR text and pasting them in the Search box of Adobe Reader made identifying the correct text a faster, easier process.

There are a number of tools available to convert PDF files to enable them for text search. This makes such resources "translator-friendlier" and may help us find the information we need to do a better job faster. Project managers and clients who are scanning documents for translation can jut as easily prepare the PDF files in this format and help their service providers.