Extract candidates
memoQ can extract possible terms from documents, translation memories, and LiveDocs corpora.
You may need this when you prepare a project for translation, or when you need to build a term base as part of a project.
In this window, you can tell memoQ how to extract possible terms from your source documents, LiveDocs corpora, or translation memories.
memoQ processes the text and gives you a list of candidates - possible terms. There may be a lot of garbage in the list: You may need to clean it up, filter, and edit it - and confirm "true" terms before you can add those to a term base. After the extraction runs, memoQ opens the candidate list editor where you can do all this.
You can also make the list of candidates available for lookup as a term base.
You need a local project to run term extraction.
How to get here
- Create or open a project.
Add text before you run term extraction: the project needs to have the text to process. The text can be in project documents, translation memories, or LiveDocs corpora.
- In the project, import the documents, or add the translation memories and LiveDocs corpora you need.
memoQ can use existing term bases to help with term extraction: Before you run term extraction, add those term bases to the project, too.
- On the Preparation ribbon, click the Extract Terms icon. The Extract candidates window opens.
If this is not the first time you run term extraction in this project: The Extract terms window opens first. If you're sure you need to run a new round of term extraction, click Start new session. To learn more, see the Help page about the Extract terms window.
What can you do?
When you run term extraction, memoQ creates a term extraction session in your project. You need a session because memoQ needs to save the list of candidates, so that you can edit them and return to them if necessary. At first, the candidate list contains a lot of noise (irrelevant expressions or even non-words). Most target-language equivalents will be missing, too. You need to clean up and edit the candidate list before you can add the confirmed terms to a term base.
memoQ saves the list of candidates in the project. You can leave it or you can return to it whenever you need to.
At the top, in the Session name box, type a name for the session. Normally, memoQ uses the current date. But you can type any other name instead.
Automatic numbering: If you use the date for session name, and start another session on the same day, memoQ adds a number after the date: (1) for the second session, (2) for the third, and so on.
Select the materials that memoQ will process to get the candidates. You have the following choices:
- Translation documents: Normally, memoQ will run term extraction on the documents you imported in the project. If you just want to process translation memories or LiveDocs corpora, clear this check box. But if there are no documents in the project, it's greyed out.
- Every document radio button: Click this to process all documents in your project. Normally, memoQ does this.
- Selected documents radio button: Click this to process the selected documents only. Before you use this, select the documents you want to work with. You can do it from Project home, under Translations.
- Translation memories: Check this check box to process the source-language text in translation memories. The translation memories must be in the project for that. If there are no translation memories in the project, this part is greyed out.
- All memories in project radio button: Click this to process all translation memories in the project. Normally, memoQ does this.
- Primary TM radio button: Click this to process the working translation memory only.
- Selected TMs radio button: Click this to process the selected translation memories. Before you use this, select the translation memories you want to work with. You can do it from Project home, under Translation memories.
- LiveDocs corpus documents check box: Check this to process the source-language text in the LiveDocs corpora in current project. This check box is not checked at first. If there are no LiveDocs corpora in the project, this part is greyed out.
- All documents shown radio button: Click this to process all documents of all LiveDocs corpora in the project. Normally, memoQ does this.
- Selected documents radio button: Click this to process the selected documents in the selected LiveDocs corpus. Before you use this, select one or more documents in a LiveDocs corpus. You can do it from Project home, under LiveDocs.
Under Options, you can fine-tune the term extraction process.
Term extraction in memoQ is fully statistical: It's based on the length and the frequency of the candidates. To extract candidates, memoQ doesn't use any linguistic intelligence like stemming or parsing. The options here control the statistical procedure.
General:
- Maximum length (words) text box: The number of words in the longest term candidate. memoQ will not list expressions that are longer than this. Normally, it's 4.
- Minimum frequency box: memoQ will not list candidates that do not occur in the source text as many or more times as the number specified here. For example, if the minimum frequency is 3, the list will contain candidates that occur 3 or more times in the source text. Normally, it's 3.
- Expression delimiters box: This is a list of characters that mark the beginning or the end of a term candidate. memoQ won't extract expressions where one or more of these characters occur inside the expression.
- Length factor box: This is a number between 0.5 and 3. It controls how much memoQ favors longer expressions. Each term candidate (that is, extracted expression) receives a score during the extraction process. The larger the length factor, the larger the difference between the score of a longer and a shorter expression. Normally, it's 1.5.
- Ignore words with numbers check box: If this is checked, memoQ won't list an expression if there is a word in it that contains one or more digits. Normally, this isn't checked.
Single-word terms: memoQ uses a different approach to extract single-word term candidates. There are different settings for them.
- Minimum length (characters) box: memoQ does not list words that are shorter (in characters) than the number specified here. For example, if the minimum length is 3, memoQ extracts single-word candidates that are 3 characters long or longer. Normally, this is 3.
Minimum length isn't used for multi-word candidates.
- Minimum frequency box: memoQ will not list candidates that do not occur in the source text as many or more times as the number specified here. For example, if the minimum frequency is 3, the list will contain candidates that occur 3 or more times in the source text. Normally, this is 3.
Term base lookup: When memoQ extracts candidates, it looks for expressions in the source-language text only. But memoQ can use term bases to look up possible translations for the extracted candidates.
- Look up candidates check box: Normally, memoQ looks up translations for each candidate in the term bases in the project. If you don't want to do this, clear the check box.
- All term bases in project radio button: Click this to look up the candidates in all term bases in the project. Normally, memoQ does this.
- Term base with the highest rank only radio button: Click this to look up the candidates in the highest-ranked term base only.
There may be words that must not occur at the beginning, at the end, or inside a term. If an expression begins with, ends in, or contains one of these words, it will not be listed as a term candidate.
These are called stop words.
- In the lower part of the Extract candidates window, you can list stop words. Each stop word has three options: you can exclude words from the beginning, the end, or from any position of an expression.
- In memoQ, you can create, save, and use stop word lists. To load an existing stop word list: Choose one from the Stop word list drop-down box.
- To save the current stop word list: Next to the Stop word list box, click Save as. The Create new stop word list window opens. Type a name and a description. Click OK.
Stop word lists are resources: You can use the Resource Console to save, load, or manage them.
The screenshot is just an example: memoQ may contain different stop word lists. On the other hand, that may be no default stop word list for your source language.
To add a new stop word to the list: In the Word box at the bottom, type the word (it can't be an expression!). Click Add.
Normally, memoQ adds the word to the list with all check boxes checked (Blocks inside, Blocks as first, and Blocks as last). After you add a word, you may want to clear one or more of these check boxes - if the word may still occur inside, at the beginning, or at the end of a term:
- Blocks inside: Clear this check box if the word may occur inside a term.
- Blocks as first: Clear this check box if the word may occur at the beginning of a term.
- Blocks as last: Clear this check box if the word may occur at the end of a term.
To remove a stop word from the list: In the list, click the word. Delete selected.
You cannot edit stop words that are already on the list. To change a stop word, delete it, then add it again.
You can prepare stop words in the Edit stop word list window, too: You don't need to run term extraction to edit a stop word list. Do this in the Resource console, and use the Edit stop word list window.
When using a read-only stop word list, you cannot change anything in the Stop words region of this window.
When you finish
To start extracting candidates: Click OK.
When memoQ finishes extracting candidates: The candidate list editor opens on a new document tab.
To return to Project home or to the translation editor, or to the Extract terms window, and not extract candidates: Click Cancel.