# An automatic terms extraction for Domain-specific corpora

Using simple frequency-based methods, such as Domain Specificity method and Domain-Specific TF-IDF, it is possible to automatically extract and score terms for given domain-specific corpus. In this article, we will use Python and its ecosystem to illustrate such methods in action.

Let's start with the definition of domain-specific terms. If a term occurs relatively more frequently in a domain-specific text than in a non-domain text, the term is regarded as domain-specific. The goal of domain-specific terms extraction is to automatically extract such terms from given corpora.

## An introduction to the text preprocessing

Usually texts from the corpora can't be used without additional preparation steps required by the specific task we want to perform on. Let's briefly go through such steps.

### Tokenization

Tokenization is the process of splitting a text into individual words, sequences of words (n-grams), symbols, or other meaningful elements called tokens. However, it is sometimes difficult to define what is meant by a "word" and it could even vary for different problems. Tokenization is also a language-specific problem and, for instance, in most cases approach for English won't work for Chinese and vice versa.

There are many ways to tokenize a text. For instance, we can just split text by punctuation or whitespace characters (e.g. space, line break, etc.)

### Part-of-speech tagging

Part-of-speech tagging (or POS tagging) is a process of assigning parts-of-speech to words. It is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken.

Why POS tagging is important for us? Because we want to select cadidates for our domain-specific terms based on POS patterns in n-grams or just by keeping nouns only.

There are some popular POS taggers for English language you might be interested in checking out:

### Stemming and Lemmatization

Stemming is a process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. A more complex approach to the problem of determining a stem of a word is lemmatization. It tries to do things right and uses a vocabulary, performs morphological analysis and applies different normalization rules for each part of speech. The canonical form returned by the lemmatization process is known as lemma.

The most common algorithm for stemming English is The Porter Stemming Algorithm by Martin Porter (1980). You can find many implementations of this algorithm as well as other approaches based on stochastic algorithm, n-grams, and so on.

### Word normalization

Word normalization is performed to aggregate all different expressions for a same concept. In speech transcripts, many words are used in several different variations such as inflections, abbreviations, and alternative spellings (e.g., UK and US spellings). To get good result it make sense to identify all such variations of a word and aggregate them into a canonical form.

### Other

There are many others steps of preprocessing that are not mentioned in this article. If you are interested in more details, check Preprocessing section from Text Analysis with Topic Models for the Humanities and Social Sciences tutorial by Allen Riddell.

## Our pipeline

In this example we use Reuters-21578 Text Categorization Collection - collection of documents that appeared on Reuters newswire in 1987. 10,788 news documents were assembled and indexed with 90 categories. You can use such collection directly from NLTK that provides already tokenized documents and able to filter by category.

The foreground corpus (fg, domain-specific corpus) we build from documents indexed with categories money-fx and money-supply, so our domain is expected to be finance-related. All other documents we use as our background corpus (bg, general corpus without any domain specificity). There are 883 documents in the foreground corpus, and 9905 documents in the background corpus.

Preparing a good foreground and background copora is not a trivial problem and there are many requirements and rules how to do so. But this is beyond the scope of this article.

Our preprocessing will contain tokenization (done by NLTK implicitly), part-of-speech tagging and lemmatization of each word:

Please note that NLTK is not well-optimized library and doesn't fit for production use in many cases. For instance, if you use nltk.pos_tag function directly, it will load POS tagger from the disk on every call (that will cost you about 15 sec). To avoid that, we need to bypass the main method and call the private one directly.

Now it's time to define that kind of tokens we are interesting in. Let's consider unigrams, bigrams and trigrams that satisfy the following criteria:

• unigrams
• noun with lemma 3 characters or more (<NN*>)
• bigrams
• a number followed by a noun (<CD><NN*>, e.g. 6.1 billions)
• a noun followed by a noun (<NN*><NN*>, e.g. milk chocolate)
• an adjective followed by a noun (<JJ><NN*>>, e.g. beautiful flowers)
• trigrams
• 3 consequative nouns (<NN*><NN*><NN*>)
• an adjective followed by 2 nouns (<JJ><NN*><NN*>, e.g. white motor yacht)
• 2 adjectives followed by a noun (<JJ><JJ><NN*>, e.g. big old house)
• a noun followed by preposition conjunction followed by a noun (<NN*><IN><NN*>, e.g. quality of service)

After preprocessing step we have all our tokens in the form (lemma, word, tag). We will perform estimations on lemmas only, so we build a dictionary to convert back from lemmas to the words. (Many word forms can have same lemma and for analysis we don't want to distinguish between them, but for output it might be interesting which form appeared in the text)

Our methods are based on frequency analysis on the tokens in the corpus, so we need to calculate tokens appearance in the document and convert list of document to the matrix of counts. In the code snippet below I use CountVectorizer from popular python library scikit-learn, but it's straightforward to do that in pure Python as well.

Well, we are ready to apply our methods, but let's first build a function that will print top extracted domain-specific terms from the ranked list by such algorithms:

## Domain Specificity method

Domain Specificity method was proposed by Park et al. (2008) and directly compare term frequencies in documents for a given domain with term frequencies in the general document collection. We can define domain specificity of a token as the relative probability of occurrence of the word in a domain text versus in a general text.

"domain_specificity"("token") = \frac{p_{fg}("token")}{p_{bg}("token")} = \frac{\frac{count_{fg}("token")}{N_{fg}}}{\frac{count_{bg}("token")}{N_{bg}}}

where, p_{fg}("token") is the probability of the token in a domain-specific (foreground) corpus, and p_{bg}("token") is the probability of the token in a general (background) corpus. Using MLE probability estimation every such probability is the number of occurrences of the token count_{"*"}("token") in the text divided by the total number of tokens N_{"*"} in the domain corpus and in the general corpus respectively. In our implementation if count_{bg}("token") = 0, then we assume denominator set to 1.

Let's run Domain Specificity method on our corpus and print top-scored 25 lemmas, and for each lemma output 2 most frequent forms:

## Domain-Specific TF-IDF method

Domain-Specific TF-IDF method used in this article was proposed by Su Nam Kim et al. (2009). It is an unsupervised method that based on TF-IDF. The basic underlying idea is that domain-specific terms occur in a particular domain with markedly higher frequency than they do in other domains, similar to term frequency patterns captured by TF-IDF.

The calculation of term frequency (TF) of the token from domain corpus is via:

TF$$"token"$$ = \frac{count_{fg}("token")}{\sum_{t \in fg}count_{fg}(t)}

where count_{fg}("token") is the number of occurrences of the token in the domain (foreground) corpus.

The inverse domain frequency (IDF) is calculated via:

IDF$$"token"$$ = log$$\frac{\|fgbg\|}{1 + \|\{d \in fgbg: "token" \in d\}\|}$$

where fgbg is the set of all documents (from foreground and background corpora).

The final TF-IDF value of a given token is the simple product of TF and IDF:

"score"$$"token"$$ = TF$$"token"$$ * IDF$$"token"$$

Let's run Domain-Specific TF-IDF method on our corpus and print top-scored 25 lemmas, and for each lemma output 2 most frequent forms:

## Conclusion

As you can see above, results from above methods are different, but pretty much close to our expectation of terms from finance news domain. By the way, it seems that many financial news on Reuters 1987 were dedicated to German finance (Bundesbank, mark, west german interest).

Of course, there are number of noisy words that aren't obviously connected to the finance domain, but co-occured in texts from the corpus (e.g. total help, week). You might want to filter them out.

In general, I can say that the methods are very fast and provide good enough results for the first iteration. You may increase the quality of results by using better POS tagger and filtering for other patterns in tokens. As the next step I would also consider such methods as Latent Dirichlet Allocation (LDA), Word2Vec, and others.