memoQ Help - Regex text filter

Using the Regex text filter, you can instruct memoQ to process structured text files, and extract translatable content from them. memoQ can also extract context and comments for the imported content. You can mainly control the regex text filter through regular expressions.

The Regex text filter processes structured text files in three steps:

1.It breaks up the files into paragraphs.

2.Extracts paragraphs that contain translatable text.

3.From the extracted paragraphs, it extracts translatable text, and optionally context and comments.

The options of the filter follow these three steps: first, you need to specify how paragraphs are separated; second, you specify how an imported paragraph should look like; third, you list those parts that really need to be translated. This procedure requires writing up regular expressions, and this is something you can do through trial and error. Before you proceed with importing the file, you can always click the Preview tab to see what will be imported.

See also: You use regular expressions to describe patterns that paragraphs or their parts must match. memoQ uses regular expressions after the Microsoft .NET fashion. For a general description of .NET regular expressions, see the Microsoft documentation. For examples of using regular expressions in memoQ, see this help topic.

How to begin

In the Translations pane of Project home,choose Import > Import with options button on the Documents ribbon tab. In the Open dialog, select All files from the Files of type drop-down list. Click Open to proceed: the Document import settings dialog appears. From the Filter drop-down list, choose Regex text filter.

Note: If you have received pre-defined regular expression settings from another user, or there are such filter configurations available on a memoQ server in your reach, you can select the filter configuration from the Filter configuration drop-down list. In this case, it may be unnecessary to change the settings in the dialog.

General settings: Codepage, paragraph separation, and reference files

In the General tab, you can set the import and export code page for the document. You can also specify how paragraphs are separated, and you can add reference files that memoQ uses to show the preview in the Preview tab.

In the Codepage and newline section, you can set the import and export codepage:

•Import codepage drop-down list: Select the encoding of the source file. The default setting is Unicode (UTF-8), but your actual file might be different. You might need to look at the file first in a plain-text editor. However, if the file starts with a so-called byte order mark (BOM), memoQ can use that to detect the encoding. Check the Override this if Unicode encoding can be detected from BOM check box (checked by default) if you want to allow memoQ to do this.

•Export codepage drop-down list: Select what encoding memoQ should use when it exports the translated document. By default, memoQ will use the same encoding, but you might need to choose a different one if, for example, the source encoding is not Unicode, and you are translating from French into Japanese.

•Newline type drop-down list: Select what sort of newline memoQ should look for in the original file. Normally, memoQ detects all sorts of newlines (Windows, Linux/Unix, and Mac are using different ones), but you might need to select a specific type, so that not all character sequences that look like a newline are imported as such.

Text editors may not be able to detect the encoding of the exported file: Always take note of the export codepage or the export encoding because text editors may open the exported file incorrectly at first. This happens because in many cases, the encoding of a plain-text file cannot be easily detected. In this case, you can manually set the encoding in the text editor, provided that you took note of it previously.

In the Paragraph separator section, you can control how memoQ separates paragraphs from each other:

•Newline radio button: Select this if one line in the file corresponds to a paragraph. In most cases, structured text files are like this.

•Empty line radio button: In some structured text file formats (such as LaTEX), paragraphs consist of multiple lines, and end in an empty line. Select this radio button if you are dealing with a file of this sort.

•Line with whitespace only radio button: Select this if paragraphs can consist of multiple lines, and end in an empty line, but that empty line may contain whitespace characters (spaces, tabs).

•Custom regex radio button: Select this if paragraphs are not separated by newline characters or empty lines. Instead, you can specify a regular expression that marks the end and the start of a paragraph. If you select this radio button, you also need to write regular expressions in the Paragraph end and the Paragraph start text boxes. Make sure the regular expression specifically matches the end and the start of the paragraph. The Paragraph end text box must not contain patterns that overlap the start of the next paragraph, and the Paragraph start text box must not contain patterns that overlap with the previous paragraph.

In the Reference files section, you can add files that memoQ displays in the Preview tab. The files that you selected for import are added automatically.

•Click Add file to add a new file to the list.

•To remove a file from the list, click the name of the file, then click Remove selected.

Paragraph settings: how does an imported paragraph look like?

In the Paragraph tab, you can specify regular expression rules. Each rule should match an entire paragraph (that is, you need to write regular expressions that cover an entire paragraph). If a rule matches a paragraph, memoQ will import text from it. In the Paragraph tab, you can also specify what part of the paragraph is imported.

If you do not specify rules on this tab, memoQ will import all paragraphs for translation.

Use the Paragraph rules table to list the regular expressions that match entire paragraphs:

•To add a new regular expression, write it in the Rule text box, and click Add.

•To change an existing regular expression in the list, click the rule in the table, make changes in the Rule text box, and then click Change.

•To remove a regular expression from the table, click the rule, and click Delete.

•To move a rule up and down in the list, select it, and click Up or Down. This can be useful if two patterns match the same paragraph, but the content groups are different. In this case, the order of processing them is important.

Note: When writing the regular expressions, use parentheses () to create content groups in them. If you enclose a pattern in parentheses, that defines a content group, and later on, you can refer to them by numbers ($0; $1; and so on).

When you select a rule in the upper table, you can use the settings in the Effect of selected rule section to determine what should happen to the paragraph. In this section, you can list content groups from the selected rule. A content group is a part of the paragraph pattern that is enclosed in parentheses () in the regular expression. You can refer to a content group by its number: $0, $1, $2 etc.

memoQ will import text for translation from the content groups you specify.

•To add a content group to the list, type its number in the Content group text box, and click Add. If you need context and comments for the content group, you can write that information in the Context and Comment boxes (before clicking Add). It can be constant text, but you can also use content group references ($0, $1 etc.) there as well.

•To change the settings for a content group, click the content group in the list, make changes in the Content group, Context, and Comment text boxes, and then click Change.

•To remove a content group from the list, click it in the list, and click Delete.

Include/Exclude settings: Specifying what needs to be imported from inside a paragraph

Preview: Seeing what is imported and what is not