XML (eXtensible Markup Language) files

XML is the world's most versatile data format. XML files can hold text, structured data, even program code. Its greatest advantage is probably that XML files are perfectly readable for humans and machines alike.

The XML format defines a sort of markup that can encode just about everything that the human mind conceives to write down. Anything that looks like this can be XML:

In many cases, XML files contain text, and text needs translation.

Many well-known document types are XML in disguise. HTML is XML. Word documents, at least DOCX, are XML. XLIFF files are XML. TMX files, holding translation memory contents, are XML. There are many more, and these are all different "flavors" of XML. A certain flavor of XML is - surprisingly - called a document type.

memoQ knows how to process XML. But if the document you get is in a well-known flavor of XML (for example, HTML, DOCX, or XLIFF), memoQ usually has a specialized filter for that. What's more, memoQ will find the specialized filter for you when you open them - when you import a Word document, memoQ will never offer to use the XML filter. Despite the fact that memoQ will use the XML filter to read the bulk of the document.

Many authoring and database systems store - or export - contents in XML files. When you translate them in memoQ, use this XML filter to import them.

XML can be multilingual, use the Multilingual XML filter for them: There are XML files that contain the same text in several languages. To import those files, use the Multilingual XML filter.

XML configurations are complex, and it's worth saving them: It may take several hours to prepare memoQ to import everything properly. When you finish this work, make sure you click the Save button next to the Filter configuration drop-down box in the Document import settings window. Next time you need to import an XML file from the same source, you can just choose the saved configuration from the same Filter configuration drop-down box.

How to get here

  1. Start importing an XML file.
  2. In the Document import options window, select the XML files, and click Change filter and configuration.
  3. The Document import settings window appears. From the Filter dropdown, choose XML filter.

    xml-encoding-tab

What can you do?

To import an XML file properly, you need to set up a lot of things. To make it easier, this topic uses the following example:

xml-sample-document

As with any XML document, it contains some “normal” text that will need to be translated, interspersed with tags like <doc> that hold descriptive or structural information primarily. Tags can have attributes, which have values (id="0527"). The following sections explain how these can be interpreted in memoQ.

When you finish

To confirm the settings, and return to the Document import options window: Click OK.

To return the Document import options window, and not change the filter settings: Click Cancel.

In the Document import options window: Click OK again to start importing the documents.

Result with the sample document

When some parts have been translated, the translation document should look like this:

xml-sample-translation

Things to mark on this screenshot:

  • The text Aug-04-2006(NOT TO BE TRANSLATED) is missing from the document because the updInfo attribute was designated as non-translated.
  • The text Diagram for illustration purposes appears as a separate segment, and alt="@2" in the img tag in segment 3 indicates that the translatable attribute's value can be found two segments lower in the translation document.

    Note: Translatable attributes are collected and stored during the import process of the document, and inserted in the translation document at the position where the current block of content ends – that is, at the next structural tag.

  • The opening tag ref was inserted in the target cell of segment 2 without the required attribute target, so memoQ shows a warning.
  • The placeholder tag img is missing from the target cell of segment 3, so memoQ shows a warning.
  • The entity '&copyright;' has been converted to '©' in segment 4.