memoQ Help - Term extraction

Why we need term extraction

Consistent translation of field-specific terms is crucial to the quality of translation. However, a translator might not recognize all possible terms in the source text.

A translation task – especially when it involves multiple translators – greatly benefits from a pre-compiled term base. This term base is especially useful if it contains the very terms that occur in the source text, and assumes the same context as in the source text.

Creating a term base from the source text manually takes a lot of time.

automatic term extraction

You can use automatic term extraction to reduce it to one-half or even one-third of the time that manual extraction would originally require.

A term extractor reads through the entire source text very quickly, and creates a list of phrases that might be terms. When the term extractor checks a phrase if it could be a term, it uses statistical and linguistic information. A term extractor might also be bilingual: this means that in addition to the list of possible source terms, it mines existing translation memories, bilingual corpora, and term bases to find target-language equivalents for the extracted source-language terms.

memoQ's built-in term extractor

memoQ's built-in term extractor module is language-independent, and does not use sophisticated linguistic tools. It uses a combination of statistical methods and detailed stop word lists. A phrase stands a good chance of being a term if it occurs in the text more than a specific number of times. However, because memoQ has no linguistic information about the words, irrelevant word sequences (for example, '*of the') might be included in the list. This is largely prevented by using stop words.

After scanning the source text for possible terms, memoQ's term extractor module queries the available term bases for target-language equivalents, and fills in the list wherever there is a match. Currently, memoQ does not scan available translation memories and LiveDocs corpora for target-language equivalents.

However, memoQ can extract source terms from translation memories and LiveDocs corpora.

Working with term extraction

An automatic term extractor always gives you a lot of noise: usually, less than 50% of the returned list consists of relevant terms. You will need to spare 2-4 hours to clean up the list, that is, delete or discard irrelevant terms. However, the alternative would be reading through hundreds or thousands of pages very thoroughly – and that would take days, not hours.

In memoQ, you can clean up your list the term extraction editor or candidate list editor (a separate document tab).

As a second step, you need to fill in the target-language equivalents for the legitimate terms that remained on the list. You can use existing term bases as a help.

Finally, you can turn your list into an ordinary term base, ready to distribute among your translators – or ready to use in your own work.

Note: In memoQ, you can also get hits from the list of extracted candidates before it is converted into a term base. You can start using the extracted terms as soon as there is something, and correct or expand them as you go.