It' quite common these days to receive scanned documents from faxes or other sources as PDFs. These can be easy or rather devilish to convert to editable text using a variety of tools, but in some cases, they are simply wanted for reference. How do you search a large, scanned PDF document for a particular bit of text?
Mostly you don't.
Unless, of course, you are clever and convert the PDF to one of the various "text-on-image" PDF formats. If you are scanning hardcopy documents, it is also possible with many scanning applications to convert the input directly to such a format.
I use ABBYY FineReader 11 to make my scanned reference PDFs searchable. This is a quick and easy process that can be performed two ways.
The first and quickest method is to use the context menu by right-clicking on a PDF or image file in the Windows Explorer.
This creates a temporary, searchable PDF which can be saved under whatever name you like. I do this for documents which serve purely as references, where I have no interest in extracting text for translation. It has the disadvantage with FineReader of working with whatever defaults are in place for the last language used.
The second method involves importing the image document into the OCR program, then saving as a searchable document after OCR. This may be useful for documents that have more than one language, where you may apply different OCR settings (for languages) on various pages.
If automatic conversion is used (usually not recommended if you plan to extract text for translation), the process can be rather quick as well, though it is a bit more cumbersome than the context menu method. For example, a 114 page scanned German insurance policy from which I had to translate excerpts was imported from the original PDF, read (processed by optical character recognition) and saved as a PDF/A (searchable text on image PDF which is the current ISO standard for long-term archiving) in slightly less than 4 minutes.
Here's a screenshot of the text search in the PDF/A document using the Adobe Reader. Without this conversion process, it would be impossible to find any text in the document using search functions, because the entire content would be bitmapped images.
Even if you do very careful OCR to extract text for translation, defining zones and optimizing as I do, there are still significant advantages to making a searchable PDF as a reference. First of all, it is often very useful to see text in its proper layout context. Secondly, doing this also helps to identify and correct OCR errors during translation work. I recently translated a scientific article with horrible resolution in the faxed and scanned source document. It was definitely a borderline case for OCR, and when I imported it into a CAT tool for translation, I had to look up a number of places in the original document to see what the text really said. Copying the errors from the source of the OCR text and pasting them in the Search box of Adobe Reader made identifying the correct text a faster, easier process.
There are a number of tools available to convert PDF files to enable them for text search. This makes such resources "translator-friendlier" and may help us find the information we need to do a better job faster. Project managers and clients who are scanning documents for translation can jut as easily prepare the PDF files in this format and help their service providers.
zappmedia network
Working with software in the translation business
2011-12-27
Best practice interoperability for outsourcers with memoQ
Given the growing popularity of Kilgray's memoQ as a staging platform for translation management in projects involving end customers and translators working with a variety of tools, it is increasingly important for translators using other tools to understand the best types of memoQ "bilingual" files to work with for their tools and the best procedures to apply. Here is a summary with links and advice relevant to various common tools and the reasons to adopt a particular approach where possible:
- Trados Workbench macros in Microsoft Word - While the bilingual DOC format of memoQ seems natural for this tool, it is often a very bad idea. Experience has shown that these are very prone to "break" when segmentation is changed or the content is copied into another file which does not contain the properties information needed for memoQ to recognize and re-import the bilingual DOC, updating the file to translate. Thus it is recommended to use the bilingual RTF tables, preferably with the mqInternal style set for the tags when the RTF file is generated in memoQ. The color difference makes it easier to check the tags when proofreading. The file should be cleaned before returning it to the outsourcer, so the target column contains only the translation.
- Trados TagEditor - The cleanest, most robust method involves using the source cells from a memoQ bilingual RTF file created with the mqInternal style specified for the tags. This source content is copied into a DOC or DOCX file, the dark red tag text hidden, and the prepared file is then translated in TagEditor. When the cleaned target file is saved, its content is pasted into the target column of the original memoQ bilingual RTF. If a comments column is provided in the RTF file, notes about terms to check or other matters can be added, and these will be available to the outsourcer after re-importing the bilingual file to memoQ. The procedure is described in detail in the lower part of the article here.
- SDL Trados Studio (2009 & 2011) - Because using the memoQ RTF tables enables certain formatting, such as bold, italic or underlined text, to be seen in SDL Trados Studio, this is recommended over the use of XLIFF files for exchange. This also avoids the current bug in SDL Trados Studio which makes it difficult to import XLIFF files if the sublanguages are not specified. A robust procedure offering tag protection is described here.
- WordFast Pro - The procedure to work with memoQ content and protect the tags is essentially the same as the recommendation for TagEditor, except that WordFast Pro can work directly with RTF files, so it is not necessary to move the content to a Microsoft Word file. The method is described here.
- Wordfast Classic -While the bilingual DOC format of memoQ is "inviting" for this tool just like with the Trados TWB macros in Microsoft Word, experience has shown that translators are very prone to "break" the DOC files by changing segmentation or copying the content into another file which does not contain the properties information needed for memoQ to recognize and re-import the bilingual DOC, updating the file to translate. Thus, as with the Trados Workbench approach, it is recommended to use the bilingual RTF tables, preferably with the mqInternal style set for the tags when the RTF file is generated in memoQ. The color difference makes it easier to check the tags when proofreading. The file should be cleaned before returning it to the outsourcer, so the target column contains only the translation.
- OmegaT - Although OmegaT handles XLIFF nicely in general, there are possibly problems with the current build of memoQ. Here there is a recommended procedure for working with the bilingual RTF tables by copying the source content into an ODT (Open Office) or DOCX file; after translation, the cells are copied into the target column of the bilingual RTF and any comments necessary are added (if the column for them is provided). The article also contains tips on the terminology data format for OmegaT to facilitate the export of terminology from memoQ.
2011-12-25
Translating content from memoQ using Trados TagEditor
The growing popularity of Kilgray's memoQ among translation agencies and corporate clients has sometimes posed challenges for users of other tools. One of the great advantages of memoQ is its ability to provide data which is compatible with many other tools, but it is still necessary to know the best way to do so to avoid trouble.
If you use an older version of Trados with TagEditor, one way to work with your client using memoQ is to request the content to translate as a bilingual XLIFF (*.xlf) file where the entire source text has been copied to the target segments. SDL Trados 2007 includes a default INI for XLIFF which will then allow you to read those "target" segments as the source in TagEditor. However, the default INI file for XLIFF in TagEditor requires optimization; among other things, it does not protect sensitive header information in XLIFF files from memoQ and SDL Trados Studio. (The German consultancy Loctimize has written some instructions on updating the INI; although these are focused on SDLXLIFF files, some of the information is relevant to XLIFF from memoQ and probably other sources.)
Translation memory content, if available, should be provided to you in TMX format, which can be read into your TWB translation memory. memoQ can also export terminology content as CSV for opening in Excel or as MultiTerm XML to import into SDL Trados MultiTerm if you use that tool. Thus your client is also able to provide you with any translation memory or terminology resources which are available.
After you have completed your translation, clean the TTX file from TagEditor to create a target XLIFF file (or just use the File > Save Target As... menu option in TagEditor). This finished XLIFF is all you need to return to the client, not your "uncleaned" TTX. When the XLIFF file is re-imported to memoQ it will include your complete translation. In case there are problems with the tags, your client will also be able to determine this and make corrections using memoQ's QA tools, though you should of course perform a careful tag check using the functions in TagEditor before you deliver.
Another popular method of data exchange for clients working with memoQ is to use the "bilingual RTF tables" in memoQ. If the files are properly prepared with a special workflow involving hiding the tags and converting the RTF to Microsoft Word format (which is described here), this is currently the best method for translating content from memoQ with TagEditor. If the RTF content is imported unmodified into TagEditor, the memoQ tags will not be protected and must be checked by the client very carefully in your delivered file. (The bilingual RTF file from memoQ must also be saved as a Microsoft Word file, because TagEditor will not read RTF properly - after translation, the file needs to be saved as RTF again.) If the client uses this method, ensure that the entire content of the source text column is copied to the target column and that the text property of all the text in the file except the target column content to translate is set to "hidden". TagEditor will then ignore the hidden text and allow you to translate the rest. After you have finished the translation, create a target file and set all the text in it to visible again. If you do work with memoQ content in this format, it is convenient if your client includes a Comments column in the file, because when you proofread your work, you can note any uncertain terms or source text problems (or other matters) in that Comments column. When the bilingual RTF table is re-imported into the client's memoQ project, the commented content can be filtered quickly and any issues identified and addressed quickly.
If you use an older version of Trados with TagEditor, one way to work with your client using memoQ is to request the content to translate as a bilingual XLIFF (*.xlf) file where the entire source text has been copied to the target segments. SDL Trados 2007 includes a default INI for XLIFF which will then allow you to read those "target" segments as the source in TagEditor. However, the default INI file for XLIFF in TagEditor requires optimization; among other things, it does not protect sensitive header information in XLIFF files from memoQ and SDL Trados Studio. (The German consultancy Loctimize has written some instructions on updating the INI; although these are focused on SDLXLIFF files, some of the information is relevant to XLIFF from memoQ and probably other sources.)
Translation memory content, if available, should be provided to you in TMX format, which can be read into your TWB translation memory. memoQ can also export terminology content as CSV for opening in Excel or as MultiTerm XML to import into SDL Trados MultiTerm if you use that tool. Thus your client is also able to provide you with any translation memory or terminology resources which are available.
After you have completed your translation, clean the TTX file from TagEditor to create a target XLIFF file (or just use the File > Save Target As... menu option in TagEditor). This finished XLIFF is all you need to return to the client, not your "uncleaned" TTX. When the XLIFF file is re-imported to memoQ it will include your complete translation. In case there are problems with the tags, your client will also be able to determine this and make corrections using memoQ's QA tools, though you should of course perform a careful tag check using the functions in TagEditor before you deliver.
Another popular method of data exchange for clients working with memoQ is to use the "bilingual RTF tables" in memoQ. If the files are properly prepared with a special workflow involving hiding the tags and converting the RTF to Microsoft Word format (which is described here), this is currently the best method for translating content from memoQ with TagEditor. If the RTF content is imported unmodified into TagEditor, the memoQ tags will not be protected and must be checked by the client very carefully in your delivered file. (The bilingual RTF file from memoQ must also be saved as a Microsoft Word file, because TagEditor will not read RTF properly - after translation, the file needs to be saved as RTF again.) If the client uses this method, ensure that the entire content of the source text column is copied to the target column and that the text property of all the text in the file except the target column content to translate is set to "hidden". TagEditor will then ignore the hidden text and allow you to translate the rest. After you have finished the translation, create a target file and set all the text in it to visible again. If you do work with memoQ content in this format, it is convenient if your client includes a Comments column in the file, because when you proofread your work, you can note any uncertain terms or source text problems (or other matters) in that Comments column. When the bilingual RTF table is re-imported into the client's memoQ project, the commented content can be filtered quickly and any issues identified and addressed quickly.
Translating and delivering Trados formats with other tools
For many years, there have been frequent, unnecessary misunderstandings between outsourcers and translators regarding the tools necessary to translate jobs for which particular data formats are required. With the current exception of most server-based projects, it is very seldom true that translations must be done with the same tools used to prepare the data for translation or manage the translated data resources.
In other words if you as a translator work with an agency or a direct client who uses a common tool such as a current or older version of SDL Trados, WordFast, memoQ or most other professional tools, it is possible to translate the data safely in the format your customer desires even if you use a different translation environment or in some cases none at all. This post focuses on satisfying the requirements for "Trados jobs", due to the widespread use of this tool in various version over the past two decades among corporate clients and translation agencies.
There are many tools which claim to be "compatible" with Trados but which are in fact not to a full extent. Or which are not unless the right techniques are used to prepare and exchange the data. This is not difficult, but it does require attention to detail and proper methods for the specific case involved.
The latest versions of SDL Trados (SDL Trados Studio 2009 and 2011) use an underlying data format which is a version of the XLIFF standard, for which SDL uses its own extension (SDLXLIFF) rather than the usual *.xlf extension. However, SDLXLIFF can be processed by tool capable of working correctly with XLIFF, which includes the later versions of Atril's DVX and the current DVX2 as well as Kilgray's memoQ, the Open Source tool OmegaT (be careful - tags must be checked carefully and possibly repaired afterward!) and many others. If you are using a tool other than a version of Trados and your customer requires full compatibility with SDL Trados Studio 2009 or SDL Trados Studio 2011, request the files to translate in SDLXLIFF format. Then import these into your working environment using an XLIFF filter. Your deliverable file will be the translated SDLXLIFF upon export from your translation tool. Please note that you cannot generate a target file ("cleaned file") from environments other than SDL Trados Studio if you are working with SDLXLIFF files. Your customer with SDL Trados Studio must do that.
Some other environments, such as Trados TagEditor (a tool in older versions of Trados) cannot successfully process an XLIFF file unless the source text is copied to the target segments. Thus, for example, if you plan to translate an SDLXLIFF file using TagEditor in SDL Trados 2007, you must ensure that the source has been copied to the target text, because the INI supplied for XLF files in the old version of Trados only reads the tags for the target segments. The TagEditor INI also requires updating to work with some of the new tag structures. Such a procedure is not necessary in a tool like Kilgray's memoQ, however, because it can access both the source and target tags of the SDLXLIFF (XLIFF) file.
If your client works with an older version of Trados and wants your translation data in an older Trados data format such as TTX or an "uncleaned" bilingual RTF or MS Word document, this is also possible to do safely, with 100% compatibility guaranteed, if the right procedures are applied. The best method to follow, even if you are working with a "Trados-compatible" tool such as WordFast, is to have the working files created and "presegmented" with the desired version of Trados. If you do not have that version, this is a task for your customer to prepare the files and ensure compatibility.
Presegmentation is a form of pretranslation, which might copy the entire source text to target segments or also insert fuzzy matches where they exist in the client's translation memory. This technique ensures that the segmentation rules followed are those set in the client's environment and that "maximum leverage" (best use) of the client's translation memory is achieved. A very detailed description of the methods necessary has been published here on the Translation Tribulations blog. The gist of it is that the "presegmentation" is to be done using Trados Workbench on the RTF, Microsoft Word, TTX or other files (which are then converted to TTX) with the unknown sentences being segmented and the selection in the translation memory options to copy the source text to the target on no match.
Understanding procedures like these is important to working together successfully and focusing on what is most important: achieving the best translation quality without technical compromises that cause lost time and money. Translators should be able to work in the environments they find most productive while still ensuring that the content delivered does not cause technical difficulties for their clients. As described here, it is entirely possible to translate and deliver files without technical difficulties for clients who "require Trados" even if you do not use Trados yourself.
In other words if you as a translator work with an agency or a direct client who uses a common tool such as a current or older version of SDL Trados, WordFast, memoQ or most other professional tools, it is possible to translate the data safely in the format your customer desires even if you use a different translation environment or in some cases none at all. This post focuses on satisfying the requirements for "Trados jobs", due to the widespread use of this tool in various version over the past two decades among corporate clients and translation agencies.
There are many tools which claim to be "compatible" with Trados but which are in fact not to a full extent. Or which are not unless the right techniques are used to prepare and exchange the data. This is not difficult, but it does require attention to detail and proper methods for the specific case involved.
The latest versions of SDL Trados (SDL Trados Studio 2009 and 2011) use an underlying data format which is a version of the XLIFF standard, for which SDL uses its own extension (SDLXLIFF) rather than the usual *.xlf extension. However, SDLXLIFF can be processed by tool capable of working correctly with XLIFF, which includes the later versions of Atril's DVX and the current DVX2 as well as Kilgray's memoQ, the Open Source tool OmegaT (be careful - tags must be checked carefully and possibly repaired afterward!) and many others. If you are using a tool other than a version of Trados and your customer requires full compatibility with SDL Trados Studio 2009 or SDL Trados Studio 2011, request the files to translate in SDLXLIFF format. Then import these into your working environment using an XLIFF filter. Your deliverable file will be the translated SDLXLIFF upon export from your translation tool. Please note that you cannot generate a target file ("cleaned file") from environments other than SDL Trados Studio if you are working with SDLXLIFF files. Your customer with SDL Trados Studio must do that.
Some other environments, such as Trados TagEditor (a tool in older versions of Trados) cannot successfully process an XLIFF file unless the source text is copied to the target segments. Thus, for example, if you plan to translate an SDLXLIFF file using TagEditor in SDL Trados 2007, you must ensure that the source has been copied to the target text, because the INI supplied for XLF files in the old version of Trados only reads the tags for the target segments. The TagEditor INI also requires updating to work with some of the new tag structures. Such a procedure is not necessary in a tool like Kilgray's memoQ, however, because it can access both the source and target tags of the SDLXLIFF (XLIFF) file.
If your client works with an older version of Trados and wants your translation data in an older Trados data format such as TTX or an "uncleaned" bilingual RTF or MS Word document, this is also possible to do safely, with 100% compatibility guaranteed, if the right procedures are applied. The best method to follow, even if you are working with a "Trados-compatible" tool such as WordFast, is to have the working files created and "presegmented" with the desired version of Trados. If you do not have that version, this is a task for your customer to prepare the files and ensure compatibility.
Presegmentation is a form of pretranslation, which might copy the entire source text to target segments or also insert fuzzy matches where they exist in the client's translation memory. This technique ensures that the segmentation rules followed are those set in the client's environment and that "maximum leverage" (best use) of the client's translation memory is achieved. A very detailed description of the methods necessary has been published here on the Translation Tribulations blog. The gist of it is that the "presegmentation" is to be done using Trados Workbench on the RTF, Microsoft Word, TTX or other files (which are then converted to TTX) with the unknown sentences being segmented and the selection in the translation memory options to copy the source text to the target on no match.
Understanding procedures like these is important to working together successfully and focusing on what is most important: achieving the best translation quality without technical compromises that cause lost time and money. Translators should be able to work in the environments they find most productive while still ensuring that the content delivered does not cause technical difficulties for their clients. As described here, it is entirely possible to translate and deliver files without technical difficulties for clients who "require Trados" even if you do not use Trados yourself.
2011-12-08
How to achieve 100% updating of the TM
1) Segmentation mismatching:
Every CAT tool offers settings for segmentation rules. These rules define the length and structure of the text to be identified and treated as a segment in the translation memory. These preset values persist even if the segmentation of individual sentences is changed during translation by splitting or merging segments. As a result, the TM gets updated with the split or merged segments, rather than with those originally counted. Thus a subsequent analysis of the source files against the updated TM may not recognise the originally counted segments as 100% matches.
2) Incomplete segmentation: A similar issue occurs if sentences are not segmented at all, for example because their contents were overwritten manually or already included in previous segments. Such sentences are not included in the TM update and recognized as No Matches during subsequent analysis comparisons.
Solution:
– Do not change the segmentation of the source text manually during translation.
– Do not split up or merge segments.
– Do not edit parts of the source text manually without segmentation.
– Never leave text unsegmented. If you already included its content in another segment, segment the superfluous text nonetheless and fill it with a plain space instead of a translation.
Every CAT tool offers settings for segmentation rules. These rules define the length and structure of the text to be identified and treated as a segment in the translation memory. These preset values persist even if the segmentation of individual sentences is changed during translation by splitting or merging segments. As a result, the TM gets updated with the split or merged segments, rather than with those originally counted. Thus a subsequent analysis of the source files against the updated TM may not recognise the originally counted segments as 100% matches.
2) Incomplete segmentation: A similar issue occurs if sentences are not segmented at all, for example because their contents were overwritten manually or already included in previous segments. Such sentences are not included in the TM update and recognized as No Matches during subsequent analysis comparisons.
Solution:
– Do not change the segmentation of the source text manually during translation.
– Do not split up or merge segments.
– Do not edit parts of the source text manually without segmentation.
– Never leave text unsegmented. If you already included its content in another segment, segment the superfluous text nonetheless and fill it with a plain space instead of a translation.
2011-08-22
How to ensure reliable cleanup of TTX documents
TagEditor files have to be treated cautiously as they tend to be really sensitive and error-prone. To recognize issues as they arise, it is generally recommended to run the command File > Save Target As ... during translation as often as possible, min. twice per hour. This option generates a clean target file from the current state of the translation and helps you to control from the very beginning, that the hitherto existing translation is technically correct and may be cleaned up without further ado.
Alongside, you should
try to avoid at least 3 of the most common reasons for cleanup failure:
1) Incorrect workstation or folder
A TTX file cannot be processed on different instances of TagEditor/Workbench. The target file must be generated on the same computer, where the translation was done. Furthermore, the source file that was used to create the bilingual TTX file must be located in the same folder as the translated TTX file.
2) Inadmissible segmentation or tag changes
TagEditor files react extremely sensitive on any changes to tags or segmentation. Popular mistakes include generating empty segments, taking over wrong tags from fuzzy matches or eliminating characters between opening and closing tags.
To avoid such mistakes, set the tag verification of TagEditor to the strictest level prior to translation start. (To do so, activate Tools >Options >Verification >Strict, and deactivate Don't check tags when Translating to Fuzzy).
Upon completion of the translation, the verification itself should be executed and any results indicating errors should be corrected (Tools >Verification).
Tipp:
Tag differences can be eliminated by copying the source tag into the translation. Most empty segments or empty opening/closing tags can be fixed by simply adding a space.
3) Special case INX
Trados is equipped with a special filter for INX files only starting from version 7.1. TagEditor, though, can open and process INX files already starting from version 7.0. However, due to the lack of the appropriate filter, these earlier versions cannot generate target files in INX format. There is no workaround - do not edit INX files with Trados versions older than 7.1!
Alongside, you should
try to avoid at least 3 of the most common reasons for cleanup failure:
1) Incorrect workstation or folder
A TTX file cannot be processed on different instances of TagEditor/Workbench. The target file must be generated on the same computer, where the translation was done. Furthermore, the source file that was used to create the bilingual TTX file must be located in the same folder as the translated TTX file.
2) Inadmissible segmentation or tag changes
TagEditor files react extremely sensitive on any changes to tags or segmentation. Popular mistakes include generating empty segments, taking over wrong tags from fuzzy matches or eliminating characters between opening and closing tags.
To avoid such mistakes, set the tag verification of TagEditor to the strictest level prior to translation start. (To do so, activate Tools >Options >Verification >Strict, and deactivate Don't check tags when Translating to Fuzzy).
Upon completion of the translation, the verification itself should be executed and any results indicating errors should be corrected (Tools >Verification).
Tipp:
Tag differences can be eliminated by copying the source tag into the translation. Most empty segments or empty opening/closing tags can be fixed by simply adding a space.
3) Special case INX
Trados is equipped with a special filter for INX files only starting from version 7.1. TagEditor, though, can open and process INX files already starting from version 7.0. However, due to the lack of the appropriate filter, these earlier versions cannot generate target files in INX format. There is no workaround - do not edit INX files with Trados versions older than 7.1!
Labels:
Trados
2011-08-17
How to force Trados to clean up a segmented Word document
The complex formatting of some Word documents challenges Trados to date. Working with the Workbench of the old Trados versions until 7.5 allows users to manually force segmentation even for those text items that cannot be processed automatically by Trados routines. What looks like an advantage during translation, turns into a problem when it comes to cleanup and target file generation, as Trados sometimes refuses to clean up documents containing such "forced" segments.
Solution: Enter your translation into the TM segment by segment, create a copy of the fully segmented document and open it in MS Word. To remove the segmentation from the document and generate the clean target version, Run the macro tw4winClean.Main
under Tools >Macro >Macros (MS Word 98-2003) or View >Macros >Show Macros (MS Word 2007ff).
Attention: This procedure does not update the TM. To ensure that all segments were saved in the TM, perform a control analysis against the updated TM. Export the updated TM and include the TM export in your final delivery to the client.
Solution: Enter your translation into the TM segment by segment, create a copy of the fully segmented document and open it in MS Word. To remove the segmentation from the document and generate the clean target version, Run the macro tw4winClean.Main
under Tools >Macro >Macros (MS Word 98-2003) or View >Macros >Show Macros (MS Word 2007ff).
Attention: This procedure does not update the TM. To ensure that all segments were saved in the TM, perform a control analysis against the updated TM. Export the updated TM and include the TM export in your final delivery to the client.
Subscribe to:
Posts (Atom)


