Edit segmentation rule set
When memoQ imports a document, it splits the document into segments. Normally, a segment is more or less the same as a sentence. But you can use completely different segments, too.
A segmentation rule set tells memoQ how to split a document into segments.
A segmentation rule set specifies two things:
- The punctuation marks that end a segment.
-
Exceptions when the same punctuation marks don't end a segment after all. These exceptions are mostly abbreviations. For example, memoQ mustn't - and won't - start a new segment after Mr. in 'Mr. Smith'. From the translation editor, you can add new abbreviations to the segmentation rule set that you use in your project.
In fact - deeper down - memoQ uses regular expressions to spot the places where a new segment should start.
In this window, you can edit a segmentation rule set.
Belongs to a project: You choose a segmentation rule set for a project. To make your choice, open a project. In Project home, choose Settings. Click the Segmentation rules icon. (This icon has a pair of scissors in it.) Check the check box of the segmentation rule set you want to use.
Language does matter: A segmentation rule set always belongs to a language. memoQ can use a segmentation rule set if its main language is the same as the source language of the project.
How to get here
Open the Resource Console. Choose Segmentation rules. In the list, click the segmentation rule set you want to edit. Under the list, click Edit.
From a project: Open a project. In Project home, choose Settings. In the Settings pane, click the Segmentation rules icon. (This icon has a pair of scissors in it.) In the list, click the segmentation rule set you want to edit. Under the list, click Edit.
From an online project: Open an online project for management. In the memoQ online project window, choose Settings. In the Settings pane, click the Segmentation rules icon. (This icon has a pair of scissors in it.) In the list, click the segmentation rule set you want to edit. Under the list, click Edit.
Can't edit a default segmentation rule set: There is a default segmentation rule set for every source language. To change the default settings, you must clone (copy) them first. To clone a segmentation rule set: Select it in the list. In Resource Console, click Clone. In a project, click Clone/use new. memoQ makes a copy of the segmentation rule set. If you're in a project, memoQ also starts using it. Then you can select the clone and click Edit.
What can you do?
Normally, a segmentation rule set appears like this:
Some segmentation rule sets appear in the advanced view: If a segmentation rule set has custom lists, you can no longer edit it in the simple view. To learn more about the advanced view, see the next section.
- Choose what symbols end a segment: If necessary, make changes in the Segment end box.
- Specify brackets and quotes: If necessary, change the Left bracket and Right bracket boxes. If there's an entire segment between these, the brackets and quotes become part of the segment.
- If the dot is after an abbreviation, the segment doesn't end: List the abbreviations in the Abbreviation box.
Add abbreviations from the translation editor: If you are using a segmentation rule set that still has the simple structure, you can add abbreviations directly from the translation editor. To do that: In the Edit ribbon, click Add abbreviation.
- Proper names that start with a lowercase letter can be at the beginning of a new segment: List these in the Proper name box.
- Some abbreviations can be only before numbers: List them in the Abbreviation before number box. memoQ will start a new segment after these - but not if they are followed by a number.
Abbreviations are case sensitive: Normally, memoQ recognizes an abbreviation if the lowercase and the uppercase letters appear the same way in the text. To make this case-insensitive, clear the Abbreviations are case sensitive check box.
If you want these lists alphabetically: To sort these lists of abbreviations and proper names alphabetically: Check the Order lists alphabetically check box. If you don't check this check box, the items appear in the order they were entered in the segmentation rule set.
Always check the preview: memoQ shows the effect of the segmentation rule set in the Preview box on the right. When there's a new segment, it's displayed in a new line. You can edit the text. To do that, click Edit sample text.
In the advanced view, you can write regular expressions that match the beginning of one segment and the start of the next one.
If the segmentation rule set appears in the simple view: At the bottom, click Advanced view.
Under Rules, there are regular expressions that match the end of a segment, and the beginning of a new one. For example, you can write a regular expression that matches a dot, a space, and a capital letter. These regular expressions have an extra feature: You need to insert an exclamation mark between hash marks (#!#) where the previous segment ends and the next one starts.
Need to break segments at hash marks? In some documents, a # character shows the end of segments. memoQ does not accept # characters in regex-based segmentation rules. Type \x23 or \u0023 instead,
If you need assistance, open the Regex Assistant: Click the icon on the right, and create a regex, or choose one from the regex library. Then click the Insert regex button. memoQ inserts your regex into the text boxes as needed.
- To add a new segmentation rule: In the box below the list, type the regular expression. Don't forget the exclamation mark (#!#) that shows where the two segments are separated. Click Add.
- To change an existing segmentation rule: Click it in the list. The regular expression appears in the box below the list. Make changes there. Don't forget the exclamation mark (#!#) that shows where the two segments are separated. Click Change.
- To delete an existing segmentation rule: Click it in the list. Click Delete.
Under Exceptions, you can list exceptions for each rule. There is a separate list of exceptions for each rule. For example, you may start a new segment if there's a dot, followed by a space and a capital letter - but not if the dot is after a known abbreviation. This is what you see in the screenshot above.
The exceptions are regular expressions similar to the rules (even the exclamation marks (#!#) are there). They must also match the neighborhood of the segment boundary. memoQ follows this logic: The text is cut if the rule matches but none of the exceptions do. Then again, memoQ doesn't start a new segment if the rule matches - but at least one of the exceptions also does.
Before you add exceptions: Under Rules, click the rule that you're adding exceptions to.
- To add a new exception: In the box below the Exceptions list, type the regular expression. Don't forget the exclamation mark (#!#) that shows where the two segments would be separated - if this weren't an exception. Click Add.
- To change an existing exception: Click it in the Exceptions list. The regular expression appears in the box below the list. Make changes there. Don't forget the exclamation mark (#!#) that shows where the two segments would be separated - if this weren't an exception. Click Change.
- To delete an existing exception: Click it in the Exceptions list. Click Delete.
Always check the preview: memoQ shows the effect of the segmentation rule set in the Preview box on the right. When there's a new segment, it's displayed in a new line. You can edit the text. To do that, click Edit sample text.
Customs lists are variable parts in a regular expression. Use them where you would need to list things.
For example, you may want to recognize segment-ending punctuation:
Without custom lists, your regular expressions would look like this (the number part is simplified):
\.#!#[\s]+\p{Lu}
\?#!#[\s]+\p{Lu}
\!#!#[\s]+\p{Lu}
If you write a custom list of segment-ending symbols, you need just one regular expression:
#end##!#[\s]+\p{Lu}
To do this, use the Custom lists tab.
If you need assistance, open the Regex Assistant: Click the icon on the right, and create a regex, or choose one from the regex library. Then click the Insert regex button. memoQ inserts your regex into the text boxes as needed.
For example, you would write custom lists to tell memoQ about abbreviations that end in a period, but they do not mark the end of a sentence - so memoQ should not cut the segment after them.
If you delete default lists, memoQ won't be able to add abbreviations from the translation editor: memoQ allows adding abbreviations directly from the translation editor. This doesn't work if you delete any of the default custom lists from a segmentation rule set. If you still want to add abbreviations from the translation editor, don't delete default custom lists. You can add new ones.
To add a custom list:
- In the box at the bottom left, type a name for the custom list. Make sure you enclose it in # characters.
- Click Add.
- To rename a custom list: Click it under Custom lists. The name appears in the box at the bottom left. Edit it as needed. Click Change.
- To delete a custom list: Click it under Custom lists. Click Delete. memoQ deletes the list items, too.
- Add items to the list: In the box at the bottom right, type an item (in the example, it would be the name of a currency). Click Add. Repeat this until you have all the items.
- To change a list item: Click it under List items. In the box at the bottom right, make changes. Click Change.
- To delete a list item: Click it under List items. Click Delete.
Custom list can't be empty: When you add a new custom list, the entire List items box is orange. This should warn you that you must add at least one item. You can't save the auto-translation rule set if one of the custom lists is empty.
To edit the items in a custom list: Click it under Custom lists. The items appear under List items. Add, edit, or delete items there.
For example, the default segmentation rule set for English contains these custom lists:
- #cap#: All capital letters of the English alphabet.
- #end#: Punctuation marks at the end of a segment. If you want to segment after a semicolon, add the semicolon to this list. In the box below List items, type a semicolon. Click Add.
- #abbr#: Common abbreviations in English. memoQ won't start a new segment after an abbreviation that's on this list.
-
#abbr_num#: Common abbreviations in English that are used before numbers. When memoQ encounters an abbreviation from this list, it checks if there's a number after the abbreviation. If there is, memoQ won't start a new segment.
memoQ recognizes numbers without a custom list: You don't need to include numbers, dates etc. on a custom list because memoQ has an internal algorithm to recognize these.
- #properNames#: Common proper names that begin with a lowercase letter. If memoQ encounters an ending punctuation mark, a space, and then a proper name, it will start a new segment - despite the lowercase letter at the beginning of the new segment.
- #lpar#: Common opening quotes or brackets. Ending punctuation, followed by a space and then an opening quote or bracket marks the start of a new segment.
- #rpar#: Common closing quotes or brackets. A sequence of ending punctuation marks, followed by a closing quote or bracket marks the end of a segment.
memoQ can import segmentation rules from other translation tools through Segmentation Rule eXchange (SRX) files. memoQ can also export its own segmentation rules in SRX files, to be used in other tools.
SRX export may lose information because standard SRX can't hold all segmentation information from memoQ: memoQ's segmentation exceptions are more sophisticated than those allowed in SRX. You have two options to export SRX: Detailed export can be used in other copies of memoQ (you don't need that because you can export the resource from Resource Console). Optimized export can be used in other translation tools.
To import and export segmentation rules from and to SRX files: Use the two buttons at the top of the Segmentation tab.
- Import SRX: Click this to import segmentation rules from an SRX file into the current rule set.
- Export SRX: Click this to export the current segmentation rule set to an SRX file. The SRX export settings window opens.
Click the Optimized export radio button. This will export the segmentation rules for a different translation tool. You don't need the detailed export. To use the segmentation rule set in another copy of memoQ, export it from the Resource Console as an mqrsrc file.
Click OK. A Save window opens. Choose a folder and a name for the SRX file.
When you finish
To save the changes, and return to Resource Console, to Project home, or to memoQ online project: Click OK.
Add the edited segmentation rule set to your project before you import documents: Create a project from a template that specifies this rule. Or, create an empty project (local or online). In Project home (or in memoQ online project), choose Settings. Click the Segmentation rules icon. In the list, check the check box of this rule. Then you can start importing documents.
To return to Resource Console, to Project home, or to memoQ online project, and not save changes: Click Cancel.
To add abbreviations to the segmentation rule set while you are translating a document: On the Edit ribbon, click Add abbreviation. You can't do this if the current segmentation rule set doesn't contain all of the default custom lists.
To find all possible abbreviations in a document or the open documents: In Project home, select the documents. Or, open a document for translation. On the Preparation ribbon, click Find abbreviations. You can't do this if the current segmentation rule set doesn't contain all of the default custom lists.