Calculating unique and repeated word counts

Calculating unique and repeated word counts


How we calculate unique and repeated words



We generate word counts using our custom-built CAT (computer-aided translation) tool which is essentially a language database. The memory found in the tool is created or built overtime when a translator adds the source text and its corresponding human translations to the database. 

 

If there's no stored memory, like in the case of a first time client, the tool looks only for text repetitions or 'repeated words' within the source document that is being translated. For every consecutive order by the same client, a translation memory (TM) is created. 

 

When we translate, we usually translate one sentence before moving to the next. Logically, this helps us understand and retain the context of the content. Since what constitutes a sentence can differ from language to language, we refer to this breakdown of text as 'segments'. Learn more about segments and repeated words

 

CAT tools compare segments against one another and identify whether each segment is unique or has been repeated. If there is even a slight difference between the two segments, each of them would be considered unique.


For example, here are three segments from a text:


My name is Mark (Unique)

My name is Mark. (Unique)

My name is Mark. (Repeated)

Although the words in all the segments are the same, there is a difference in punctuation. Since there is no full stop after the word ‘Mark’ in the first segment, it is considered unique. The second and third segments, however, are identical, and so, we consider the first occurrence of the segment as unique and any subsequent use as repetition.

You may wonder why the second segment is unique even if the words are the same. This is because we don't look at individual words but rather the segment as a whole. This helps us maintain the context. 

 

To calculate the repeated words for a segment, we multiply the number of times the segment is repeated by the number of words within the segment. To get the total number of repeated words, we follow the same calculation for each segment that is repeated and add the answers. 

    • Related Articles

    • What makes Lingpad unique

      Lingpad is a cloud-based professional, translation management system (TMS) for users to automate and efficiently manage their company-wide localisation processes. Lingpad is designed to help diverse teams work together under one single dashboard with ...
    • Requesting a translation

      To request a translation, you will need to set up billing details for your account. Setting up billing details  To set up billing details, click on your image on the top right corner, from the drop-down, click on the settings icon of the organisation ...
    • Factors that affect order pricing

      Factors that affect order pricing There are several pricing models used in translation. Regardless of which model you choose, several factors affect the price of the service. The most important ones are as listed: Language pair: The demand for ...
    • Configuring a file

      Once you've uploaded files to your project, there are some files you will need to configure for the platform to use. You can identify these file by a yellow exclamation mark that will appear with a button that reads Configure. To configure a file, ...
    • Cookie policy

      Most websites you visit will use cookies in order to improve your user experience by enabling that website to ‘remember’ you, either for the duration of your visit (using a ‘session cookie’) or for repeat visits (using a ‘persistent cookie’). Cookies ...