XML (eXtensible Markup Language) files
XML is the world's most versatile data format. XML files can hold text, structured data, even program code. Its greatest advantage is probably that XML files are perfectly readable for humans and machines alike.
The XML format defines a sort of markup that can encode just about everything that the human mind conceives to write down. Anything that looks like this can be XML:
Anything that looks like this can be XML:
<thing>
<part-of-thing looks="good">
Here is the part of thing
</part-of-thing>
</thing>
A portion enclosed like <thing>...</thing> is called an element.
The part that does the encoding ('<thing>', '</thing>', '<part-of-thing>' etc.) is called a tag. The '<thing>' at the top is an opening tag, and the '</thing>' at the bottom is a closing tag. There can be elements that have nothing inside. Those could be written as '<emptiness></emptiness>', but XML simplifies this by writing '<emptiness/>'. That is called an empty tag.
Inside a tag that starts an element, there can be parts that describe what the element is like. They are called attributes. In the example, 'looks="good"' is an attribute. An attribute has a name ('looks') and a value ("good").
There is a problem: In XML, text is mixed with markup. In the markup, two characters are very special: < and >. The text in the elements can't contain these characters. Instead, XML writes < (less than) for '<', and > (greater than) for '>'.
Anything in the XML text that looks like &(symbol); represents a single character. These are called character entities or just entities. But then another character, the ampersand ('&') became special. So, if you want to write a real ampersand in the text, you must write an entity called &.
Now, when you want to describe a particular type of document or data in XML, you must decide what elements, tags, attributes, and entities to use. When you're down to a particular document format that is based on XML, it means that you have chosen a well-defined set of tags in a well-defined structure. HTML is one of those examples. You can't just use any tag in HTML - there are rules of what tags can go where. If you use a different tag, unknown to HTML rules, that will make your document invalid.
The rules that describe what tags and attributes you can use in a type of XML document is called a document type definition. There are separate files that describe a type of document. It's either a document-type-definition file (DTD) or an XML schema.
In many cases, XML files contain text, and text needs translation.
Many well-known document types are XML in disguise. HTML is XML. Word documents, at least DOCX, are XML. XLIFF files are XML. TMX files, holding translation memory contents, are XML. There are many more, and these are all different "flavors" of XML. A certain flavor of XML is - surprisingly - called a document type.
memoQ knows how to process XML. But if the document you get is in a well-known flavor of XML (for example, HTML, DOCX, or XLIFF), memoQ usually has a specialized filter for that. What's more, memoQ will find the specialized filter for you when you open them - when you import a Word document, memoQ will never offer to use the XML filter. Despite the fact that memoQ will use the XML filter to read the bulk of the document.
Many authoring and database systems store - or export - contents in XML files. When you translate them in memoQ, use this XML filter to import them.
XML can be multilingual, use the Multilingual XML filter for them: There are XML files that contain the same text in several languages. To import those files, use the Multilingual XML filter.
XML configurations are complex, and it's worth saving them: It may take several hours to prepare memoQ to import everything properly. When you finish this work, make sure you click the Save button next to the Filter configuration drop-down box in the Document import settings window. Next time you need to import an XML file from the same source, you can just choose the saved configuration from the same Filter configuration drop-down box.
How to get here
- Start importing an XML file.
- In the Document import options window, select the XML files, and click Change filter and configuration.
- The Document import settings window appears. From the Filter dropdown, choose XML filter.
What can you do?
To import an XML file properly, you need to set up a lot of things. To make it easier, this topic uses the following example:
As with any XML document, it contains some “normal” text that will need to be translated, interspersed with tags like <doc> that hold descriptive or structural information primarily. Tags can have attributes, which have values (id="0527"). The following sections explain how these can be interpreted in memoQ.
Check if memoQ uses the correct character encoding to import the XML files. Normally, the header of the XML file states the encoding. If it's missing, memoQ normally uses the UTF-8 encoding. Check the Preview pane if everything appears correctly. If not, choose a different encoding in the Default encoding list.
This encoding is for the preview and the reference files only: You still need to set up the actual import and export encoding on the General tab.
memoQ can use the document type definition (DTD) or the XML schema (.xsd) to determine what tags and attributes can be present in the XML document. Without a DTD file or an XML scheme, memoQ reads one or more reference files to discover the tags and attributes. Normally, memoQ automatically adds all documents that you're importing as reference files. To add more files, click Add file.
Always use the schema or the DTD if you have them: If you have the DTD or the XML schema, pick it up: Next to the DTD/schema box, click Browse, and find the DTD or XSD file.
memoQ can automatically pick a filter configuration if there is a DTD or an XML schema: If you're using a DTD or an XML schema, click the General tab, and type the name (just the name) of the DTD file in the DTD or namespace URL box. But if you have a schema, the XSD file will contain an address to the namespace. Copy that from the file to the DTD or namespace URL box. Make sure you save the filter configuration. Next time you import an XML file that uses the same DTD or schema, memoQ will automatically load the same configuration because it was associated with the DTD or the schema.
Normally, the header of the XML file states the encoding.
Normally, memoQ tries to find out the encoding of the files you are importing. If the encoding isn't written in the XML, memoQ uses the UTF-8 encoding.
If you need to change this, click the General tab, and use the settings under Content.
If memoQ couldn't detect the encoding correctly: Clear the Detect encoding if possible check box. From the Input encoding if not detected dropdown, choose the encoding you need. You can check the encoding on the Encoding and reference files tab, in the Preview box.
If the input encoding is not Unicode, and the target language uses a different writing to the source: From the Output encoding dropdown, choose the encoding memoQ should use when it exports the translation. Normally, memoQ uses the same encoding as the source document, and it's perfectly normal if the input encoding is a form of Unicode (like UTF-8 is).
Normally, memoQ exports XML files with a byte-order mark at the beginning of the file. Some content management systems require this. Other systems ignore it. So, don't clear the Write BOM to Unicode-encoded files on export check box.
Normalize whitespace by default: Normally, memoQ converts sequences of tab, space or newline characters into a single space character. In addition, it trims whitespace sequences at the beginning and end of elements. Use this when the XML document only uses whitespace characters to improve readability. If the whitespace characters are required for formatting or structure, clear the Normalize whitespace by default check box.
Note: In the sample document, the text inside the <par> element contains newlines and spaces which do not hold important information – they only make the document easier to read by a human. On the other hand, these whitespace characters might be difficult to handle during translation. So, in this case, leave whitespace normalization turned on.
memoQ will not preserve spaces inside tags: Even if you clear Normalize whitespace by default, memoQ will remove any extra spaces it finds inside tags. For example, if the source document contains <br />, memoQ will always export <br/>, without the space.
Observe xml:space attribute in file: XML documents can contain attributes that tell whether or not whitespace should be normalized in a specific element. Normally, memoQ follows these instructions in the document. If you need to treat whitespace the same way across the document, clear the Observe xml:space attribute in file check box.
Change xml:lang attribute value at export: Normally, memoQ overwrites xml:lang attribute values with the actual target language's ISO code. If you need to keep the original values for this attribute, clear the Change xml:lang attribute value at export check box.
Break segments at newlines if whitespace is preserved: Check this check box if you want memoQ to start a new segment at every newline character. Text imported from XML files may contain newline characters only if you choose to preserve whitespace - in other words, if the Normalize whitespace by default check box is cleared, or when the xml:space attribute says so. When whitespace characters are preserved, newline characters are supposed to have a meaning in the text, and most of the time each line should be translated as a separate segment.
It's recommended to check the Break segments at newlines if whitespace is preserved check box if you clear the Normalize whitespace by default check box at the same time.
In some XML files, you need to translate the comments, too. Normally, memoQ doesn't import them.
If you need to change this, click the General tab.
To import the XML comments for translation: Check the Translate XML comments check box.
Normally, memoQ breaks segments whenever it encounters comments. If you need to translate the comments, memoQ imports them in a separate segment at position where they are in the text. Comments are segmented according to the segmentation rules in the project.
To import the XML comments as inline tags: In the XML comment handling drop-down box, choose Import as <mq:cmt> tag. memoQ will transform comments into special inline tags (mq:cmt). If you need to translate the comments, they will be translatable attributes of the mq:cmt inline tag. On the Tags and attributes tab, you need to mark these attributes translatable. In this case, the text of the comments is not segmented.
Don't use legacy memoQ {tags}: In the XML comment handling dropdown, don't choose Import as memoQ {tag}.
64-character preview if comments aren't translated: If you don't import comments for translation, the translation editor displays the first 64 characters as preview in the filtered and long tag views. The full text of the comments is saved with the XML file's skeleton, so that they can be exported back in the translated file.
Normally, memoQ gives you a standard preview of XML documents. This shows all the tags and the structure of the document.
But if the XML document represents formatted text, it's possible to get a better formatted preview.
Normally, XML doesn't say anything about the actual formatting. This information needs to be added to it from the outside. For that, you need an XML stylesheet or an XML stylesheet template (XSLT).
XSLT transforms the XML files into something else that can be nicely displayed. For example, if an XML is converted into HTML through XSLT, the result can be viewed in a web browser.
memoQ accepts XSLT files that transform XML files into HTML. You need to have this file ready; memoQ doesn't know how to create such files.
To learn how to make XSLT files, visit this web page: https://www.w3schools.com/xml/xsl_intro.asp. Fair warning: You need to be interested in programming to benefit from it.
Once you have the XSLT file, click the button next to the XSLT file box.
XSLT must produce HTML: The XML style sheet must create HTML output. If the style sheet emits a different format (plain text, RTF, or another XML), you cannot use it here.
To change the XSLT preview: Click Remove XSLT assignment. Then, if necessary, specify another XSLT file.
There are other settings that appear on the General tab (see the screenshot below).
-
XML files may contain processing instructions. They look like tags, but they start with the '<?' characters. They are run by the content management system storing the file, or else they are processed by another program that displays the file. Normally, memoQ displays these as uninterpreted formatting tags.
This is not recommended. To display processing instructions as inline tags: Check the Import processing instructions as inline tags check box. memoQ will display them as mq:pi inline tags.
-
There may also be custom entities (see details later on this page). These represent characters that are not defined in the XML standard, but are used by the content management system or the program that displays the file.
Normally, memoQ displays and exports these as characters, not as entity codes (&(entityname);).
To restore the entity codes for these characters when the translation is exported: Check the Restore custom entities in export check box. When custom entities are restored, '©' in the target text will be exported as '©right;'.
-
XML files can have various problems, and memoQ may have several problems when it imports XML files. To get a log of all the warnings: Check the Log warnings during import check box. memoQ will list the problems in a text file. If there are warnings, memoQ will ask you to save the log file in the same folder where the original document is.
An example of a log of import problems:
Normally, memoQ imports all text and other contents from XML files. If you need to control what is imported as text, or you need to exclude some of the contents from the translation, you need the options on the Tags and attributes tab.
You can set options for each tag and attribute. But before that, memoQ needs to read all the tags and attributes from the files you're importing.
To get the tags and attributes from the reference files: Click Populate.
To start over and choose a different set of files for previewing: Click Clear list. Go back to the Encoding and reference files tab. Choose different reference files. Click the Tags and attributes tab again. Click Populate.
In the Tags and attributes tab, 'tags' mean elements - specific tags and the content between them. For example, when you set up translation for the Abstract 'tag', that will belong to the content from the <Abstract> tag to the </Abstract> tag.
While you are setting up the tags and their attributes, you can preview them at the bottom of the Document import settings window.
Under Occurrences, memoQ lists the places where the selected tag occurs in one of the reference documents. Tags are highlighted in red, attributes are highlighted in green.
- File: In this dropdown, choose a reference file to view.
- Instance: Click a number to view one occurrence of the selected tag in the selected reference file.
To choose if certain tags are translated, or if they break segment, use the settings in the Tags section:
- Handled tags: This list has all the tags that were read from the XML files, as well as the ones you added.
Settings for each tag are abbreviated: In the Info column of the Handled tags list, you see abbreviations of tag settings. These are the following: Str stands for a structural tag (breaks segment); In for an inline tag (doesn't break segment, imported as an inline tag); NT for non-translated; and Req for required. You can set whitespace handling by tag, too: Inh means the tag inherits whitespace settings from its parent (the element that contains it), Pres means to preserve whitespaces, and Norm means to normalize whitespaces. Context handling and commenting options: Ctxt means that the content of the tag is imported as a context ID. Com means that the content of the tag is imported as a comment. All of these types and options are explained below.
To set these options for a tag, click the tag in the Handled tags list. Then use the check boxes to the right to make your choices.
- Inline: Check this check box to make the selected tag an inline tag. Inline tags are markup that are inside segments. They are displayed as inline tags. Other tools may call inline tags internal tags. If you don't check this check box, memoQ will handle the tag as a structural tag. Structural tags mean elements that are blocks of content for translation. Structural tags never appear inside text for translation: they always start a new segment. In other tools, structural tags are called external tags.
In the example: Specify the ref and img tags as inline because they appear inside sentences. All other tags should remain structural tags.
- Not translated: Check this check box exclude the selected tag from translation. These portions of text will not be imported for translation.
If an element isn't translated, neither are its children: If you make an element non-translated, other elements inside it will not be imported either. In the example, do not make the Body or the Main elements non-translated because then nothing will be imported.
- Required: Check this check box to make the selected tag required. Required tags are special inline tags that must be copied to the translation. If a required tag is missing from the translation, memoQ will display an error in the segment, and you won't be able to export the document.
- Whitespace handling: Use this drop-down box to choose how white space is handled in that element. Inherit means that the element will handle whitespace the same way as the parent element. The root element receives the default setting from the General tab. Preserve means that all white space will be retained and imported into the translation document. Normalize means that sequences of white space characters will be replaced with a single space character.
- Tag content is context ID for siblings: If you check this check box, memoQ uses the contents of the selected tag as the context identifier of the next segment imported from the elements that are at the same level in the hierarchy.
Note: This context ID will not be applied to all segments imported from the same level of the hierarchy. Instead, memoQ uses the context ID only for the next suitable segment. Other siblings will remain without a context ID.
- Tag content is comment for siblings: If this check this check box, memoQ uses the contents of the selected tag as a comment for the segment(s) that are imported from the elements at the same level in the hierarchy.
Note: The comment or the context ID is is used in one element only, the one that is after the element used as context or comment.
- : Click this button to remove the selected tag from the Handled tags list.
-
: Click this button to add the tag entered into the text box to the Handled tags list.
If a tag is missing from the list: When memoQ imports an XML file, it may find tags that are not listed in the configuration. These tags will be imported as structural and translatable. They will inherit the white space settings from the parent element. Their content will not be imported as a comment or a context identifier.
Attributes of tags can also be used as context identifiers or comments. But attributes may also contain text to be translated.
To choose what happens to attributes:
- Select a tag in the Handled tags list. Under Tag attributes, memoQ lists the attributes that belong to the selected tag.
- To change the settings of the selected attribute, use the check boxes to the right.
Attribute settings appear as abbreviations: In the Info column of the Tag attributes list, you see the following attributes: Tr stands for translatable; Req for required; and F for filtered. NX and NY mean conditions to import or omit the tag that has the attribute, CxC and CxS stand for context identifier options, while CmC and CmS are commenting options.
The xml:lang attribute is different: You cannot control how memoQ processes the xml:lang attribute – it happens automatically, based on the Change xml:lang attribute value at export setting. memoQ does not import the xml:lang attribute of inline tags, unless someone manually adds it. You can add the xml:lang attribute to inline tags, but you cannot specify any options there.
The options are:
- Translatable: Check this check box to make the selected attribute translatable. This means that the value of the attribute is imported as normal text.
- Required: Check this check box to make the selected attribute one that must be present in any tag inserted to the translation. A required attribute is not necessarily translatable: memoQ uses it for quality-checking and ensuring the well-formedness of the translation.
- Filtered: Check this check box to hide the selected when you switch to the Show filtered inline tags view in the translation editor.
- Context: Click this button to make the value of the selected attribute context information for the children or the siblings of the selected tag. After clicking this button, the Context settings for attribute window appears with a list of options and their explanation:
In the example: The id attribute of the par element may be used as a context identifier.
- Comment: Click this button to make the value of the selected attribute a comment for the children or the siblings of the selected tag. After clicking this button, the Comment settings for attribute window appears with a list of options and their explanation:
- : Click this button to remove the selected attribute from the Tag attributes list.
-
: Click this button to add the attribute entered into the text box on the left to the Tag attributes list.
If an attribute is missing: If an attribute is not listed in the configuration, memoQ treats it as non-translatable, not required and not filtered. These attributes will not be used for non-translation conditions, nor in context or comment processing.
You can import or ignore a tag, depending on the value of an attribute. In other words, you can set conditions for importing the contents of a tag (and all its children).
For example, a tag may have an attribute called 'translate'. The value can be either translate="no" or translate="yes". You don't want to import the tags where it says translate="no".
To do this:
- Click the General tab. To get the list of tags from the reference documents if needed: Click the Populate button.
- Select a tag that you want to import conditionally.
- In the Tag attributes list, select the tag that you want to use as condition.
- On the right, click Non-translation to set up the condition.
Under Values, list the values that you want to check for. If the tag has one of the values that you list, memoQ will import or ignore the tag contents, depending on the condition you set up at the top.
- To list the values: Type a value in the text box at the bottom. Click the button. If you need to test for several values, repeat this.
If you want to check if the translate attribute is set to no, type "no" at the bottom, and click .
- To ignore the tag if the attribute has one of the values listed: Click the Do not import if radio button. In the example, you would click this, so that memoQ omits the tags where translate="no" is set.
- To import the tag if the attribute has one of the values listed: Click the Import only if radio button. In the example, you would add yes to the list (instead of no), and click this radio button. This causes memoQ to import the tags where translate="yes" is set.
If the attribute is missing: If you check the Disable rule if attribute is missing check box, memoQ will act as if you selected No condition. If you check the Also if attribute is missing check box, memoQ will act as if the attribute had an empty value.
XML documents contain entities. These are characters that can't be included in the text as they are: either because they are special characters in the XML syntax, or they don't fit in the encoding of the document.
In XML text, entities look like '&(entity-code);'.
There are some standard entity groups that the XML standard recognizes. You can choose one or more of them.
In addition, you can set up custom entities that are specific to the documents you are importing.
On this tab, if you tell memoQ to 'handle' an entity, that means that memoQ imports the entity as a normal character. That's how you will see it in the translation editor. But when memoQ exports the translation, these characters are exported as entity codes again.
To deal with these, use the settings on the Entities tab:
Here are the settings you can use:
- Entity groups: In this list, you can select standard groups of entities which should be converted during import. XML Predefined entities (&, <, >, ", and ') are always handled.
- Custom entities: In this list, you can specify any non-standard entities that are specific to your document type. Custom entities can be handled in the translation editor as inline tags, memoQ formatting tags, or “normal” Unicode characters. You choose which through three radio buttons under Entity behavior. To add a new entity to the list, type it in the Entity box. To change the settings of an existing entity, select the entity in the Custom entities list. In the sample document, there is one custom entity, ©right;, which should be converted to © for translation.
- Add/change: Click this button to add a custom entity to the Custom entities list. If you are modifying an existing custom entity, this saves your changes.
Note: In the first field under the Custom entities list, you can enter the entity appearing in your document between & and ;. Using the radio buttons, you can select whether this entity should be treated as a character or as a memoQ tag. If the entity should appear on the translation grid as a character, enter its Unicode code into the second field or enter the character into the third field.
- Remove: Click this button to remove the selected custom entity from the Custom entities list.
- Populate from files: Click this button to extract all custom entities that occur in any of the reference files. After clicking Populate from files, all custom entities appear in the Custom entities list.
When you finish
To confirm the settings, and return to the Document import options window: Click OK.
To return the Document import options window, and not change the filter settings: Click Cancel.
In the Document import options window: Click OK again to start importing the documents.
Result with the sample document
When some parts have been translated, the translation document should look like this:
Things to mark on this screenshot:
- The text Aug-04-2006(NOT TO BE TRANSLATED) is missing from the document because the updInfo attribute was designated as non-translated.
-
The text Diagram for illustration purposes appears as a separate segment, and alt="@2" in the img tag in segment 3 indicates that the translatable attribute's value can be found two segments lower in the translation document.
Note: Translatable attributes are collected and stored during the import process of the document, and inserted in the translation document at the position where the current block of content ends – that is, at the next structural tag.
- The opening tag ref was inserted in the target cell of segment 2 without the required attribute target, so memoQ shows a warning.
- The placeholder tag img is missing from the target cell of segment 3, so memoQ shows a warning.
- The entity '©right;' has been converted to '©' in segment 4.