2012-03-22

Improving scanned PDFs for translation reference

It' quite common these days to receive scanned documents from faxes or other sources as PDFs. These can be easy or rather devilish to convert to editable text using a variety of tools, but in some cases, they are simply wanted for reference. How do you search a large, scanned PDF document for a particular bit of text?

Mostly you don't.

Unless, of course, you are clever and convert the PDF to one of the various "text-on-image" PDF formats. If you are scanning hardcopy documents, it is also possible with many scanning applications to convert the input directly to such a format.

I use ABBYY FineReader 11 to make my scanned reference PDFs searchable. This is a quick and easy process that can be performed two ways.

The first and quickest method is to use the context menu by right-clicking on a PDF or image file in the Windows Explorer.

This creates a temporary, searchable PDF which can be saved under whatever name you like. I do this for documents which serve purely as references, where I have no interest in extracting text for translation. It has the disadvantage with FineReader of working with whatever defaults are in place for the last language used.

The second method involves importing the image document into the OCR program, then saving as a searchable document after OCR. This may be useful for documents that have more than one language, where you may apply different OCR settings (for languages) on various pages.

If automatic conversion is used (usually not recommended if you plan to extract text for translation), the process can be rather quick as well, though it is a bit more cumbersome than the context menu method. For example, a 114 page scanned German insurance policy from which I had to translate excerpts was imported from the original PDF, read (processed by optical character recognition) and saved as a PDF/A (searchable text on image PDF which is the current ISO standard for long-term archiving) in slightly less than 4 minutes.

Here's a screenshot of the text search in the PDF/A document using the Adobe Reader. Without this conversion process, it would be impossible to find any text in the document using search functions, because the entire content would be bitmapped images.

Even if you do very careful OCR to extract text for translation, defining zones and optimizing as I do, there are still significant advantages to making a searchable PDF as a reference. First of all, it is often very useful to see text in its proper layout context. Secondly, doing this also helps to identify and correct OCR errors during translation work. I recently translated a scientific article with horrible resolution in the faxed and scanned source document. It was definitely a borderline case for OCR, and when I imported it into a CAT tool for translation, I had to look up a number of places in the original document to see what the text really said. Copying the errors from the source of the OCR text and pasting them in the Search box of Adobe Reader made identifying the correct text a faster, easier process.

There are a number of tools available to convert PDF files to enable them for text search. This makes such resources "translator-friendlier" and may help us find the information we need to do a better job faster. Project managers and clients who are scanning documents for translation can jut as easily prepare the PDF files in this format and help their service providers.

11 comments:

  1. Thanks for your interesting blog post.
    I didn't know what PDF/A was, thanks a lot.
    By "Adobe FineReader 11", you mean ABBYY FineReader 11?

    ReplyDelete
    Replies
    1. Thx for the hint. Of course it is ABBYY :-)

      Delete
    2. Fred, the PDF Association web site has quite a lot of information on PDF/A and other PDF formats. They also do presentations and workshops at a number of trade events. I find it useful to keep track of such things, not only for translation projects, but because I often encounter questions relevant to document security and archiving.

      Delete
    3. Thanks a lot for the link and info. Can protected PDFs be processed the same way by ABBYY FineReader? (of course there's the pdfunlock website that can help (I only use it for non-confidential documents)).

      Delete
    4. FineReader cannot unlock PDF files. But as a workaround you might save the PDF as TIF or any other bitmap format and load it into your OCR software again...

      Delete
  2. I charge 50% extra for PDFs. Guess what? The clients are usually pretty quick in finding the editable version, and in the rare cases where they don't, they agree to pay the surcharge. They do come back for more, but they mostly never send me PDFs again.

    Seriously, I want to spend my time translating. It is all getting to the point where translators hardly spend time translating anymore, and they are increasingly become DTP specialists. Thanks, but no thanks.

    By the way, the above post is also mine (for some reason, the text box got all messed up).

    Viktoria Gimbe

    ReplyDelete
    Replies
    1. @Viktoria: Hefty surcharges for PDFs to be translated make sense and I follow a similar practice myself. What I am discussing here, however, is a method of dealing with PDFs which may be strictly for reference or for which you may already have a decent text to translate but where you require the layout context for what you are translating. An example of the former case might be scanned patents related to a patent I am translating. I might want to look up certain terms in those patents (which I am NOT translating), but without converting them to searchable documents that is extremely time-consuming. Most of the patent documents available as image PDFs online are not searchable.

      Delete
  3. Hi Kevin,

    This works on Omnipage 18 too (Scanned Document to Searchable PDF). Btw, I absolutely agree with Viktoria about what the work of a translator SHOULD be.

    HH

    ReplyDelete
    Replies
    1. @HH: You are right. There is even more OCR software able to do this job somehow. And that's the snag! Which one is the best for preparing the results usable for translators (e.g. in CAT tools)? We found out, it is ABBYY Finereader.

      Delete
    2. @zappmedia: ABBYY Finereader? For preparing texts for translation, perhaps. I've heard similar claims from others. But for this particular task - making a reference document searchable - I don't think any of the good professional tools will be noticeably better than the other. I'm sure that both Finereader and OmniPage will do better at this task than the software I got free with my HP printer/scanner/fax, which also can create a searchable image PDF.

      Delete
  4. With regard to preparing OCR content for translators, I would like to put in a quick plug for Dave Turner's Code Zapper macros. For years now, these have saved translators countless hours of grief, and a number of agencies (as well as a great many translators) I know use them routinely to prepare files so that words won't be broken up with trash tags, etc.

    ReplyDelete