An automatic terms extraction for Domain-specific corpora

Using simple frequency-based methods, such as Domain Specificity method and Domain-Specific TF-IDF, it is possible to automatically extract and score terms for given domain-specific corpus. In this article, we will use Python and its ecosystem to illustrate such methods in action.

Let's start with the definition of domain-specific terms. If a term occurs relatively more frequently in a domain-specific text than in a non-domain text, the term is regarded as domain-specific. The goal of domain-specific terms extraction is to automatically extract such terms from given corpora.

An introduction to the text preprocessing

Usually texts from the corpora can't be used without additional preparation steps required by the specific task we want to perform on. Let's briefly go through such steps.

Tokenization

Tokenization is the process of splitting a text into individual words, sequences of words (n-grams), symbols, or other meaningful elements called tokens. However, it is sometimes difficult to define what is meant by a "word" and it could even vary for different problems. Tokenization is also a language-specific problem and, for instance, in most cases approach for English won't work for Chinese and vice versa.

There are many ways to tokenize a text. For instance, we can just split text by punctuation or whitespace characters (e.g. space, line break, etc.)

from nltk import word_tokenize
 
text = word_tokenize("This is an example.")
print text
 
['This', 'is', 'an', 'example', '.']

Part-of-speech tagging

Part-of-speech tagging (or POS tagging) is a process of assigning parts-of-speech to words. It is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken.

Why POS tagging is important for us? Because we want to select cadidates for our domain-specific terms based on POS patterns in n-grams or just by keeping nouns only.

There are some popular POS taggers for English language you might be interested in checking out:

from nltk import pos_tag, word_tokenize
 
text = word_tokenize("This is an example.")
print pos_tag(text)
 
[('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'), ('.', '.')]

Stemming and Lemmatization

Stemming is a process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. A more complex approach to the problem of determining a stem of a word is lemmatization. It tries to do things right and uses a vocabulary, performs morphological analysis and applies different normalization rules for each part of speech. The canonical form returned by the lemmatization process is known as lemma.

The most common algorithm for stemming English is The Porter Stemming Algorithm by Martin Porter (1980). You can find many implementations of this algorithm as well as other approaches based on stochastic algorithm, n-grams, and so on.

from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
print lemmatizer.lemmatize("corpora")
# corpus

Word normalization

Word normalization is performed to aggregate all different expressions for a same concept. In speech transcripts, many words are used in several different variations such as inflections, abbreviations, and alternative spellings (e.g., UK and US spellings). To get good result it make sense to identify all such variations of a word and aggregate them into a canonical form.

Other

There are many others steps of preprocessing that are not mentioned in this article. If you are interested in more details, check Preprocessing section from Text Analysis with Topic Models for the Humanities and Social Sciences tutorial by Allen Riddell.

Our pipeline

In this example we use Reuters-21578 Text Categorization Collection - collection of documents that appeared on Reuters newswire in 1987. 10,788 news documents were assembled and indexed with 90 categories. You can use such collection directly from NLTK that provides already tokenized documents and able to filter by category.

The foreground corpus ( $fg$ , domain-specific corpus) we build from documents indexed with categories money-fx and money-supply, so our domain is expected to be finance-related. All other documents we use as our background corpus ( $bg$ , general corpus without any domain specificity). There are 883 documents in the foreground corpus, and 9905 documents in the background corpus.

Preparing a good foreground and background copora is not a trivial problem and there are many requirements and rules how to do so. But this is beyond the scope of this article.

from nltk.corpus import reuters
 
fg_files = reuters.fileids(["money-fx", "money-supply"])
bg_files = filter(lambda file_id: file_id not in fg_files, reuters.fileids())
 
fgbg_files = fg_files + bg_files
fgbg_labels = [1]* len(fg_files) + [0] * len(bg_files)

Our preprocessing will contain tokenization (done by NLTK implicitly), part-of-speech tagging and lemmatization of each word:

import itertools
import nltk.tag
from nltk.tag.perceptron import PerceptronTagger
from nltk.stem import WordNetLemmatizer
 
tagger = PerceptronTagger()
lemmatizer = WordNetLemmatizer()
 
def preprocess(file_id):
    """Get text from file_id and perform basic preprocessing steps."""
    bag_of_words = reuters.words(file_id)
    tagged = nltk.tag._pos_tag(bag_of_words, None, tagger)
 
    # calculate bigrams, unigrams and trigrams,
    # and unify the form by join words with "~"
    unigrams = map(lambda (w, t): (lemmatizer.lemmatize(w.lower()), w, t), tagged)
    bigrams = map(
        lambda token: ("~".join(tt[0] for tt in token), "~".join(tt[1] for tt in token), "~".join(tt[2] for tt in token)),
        zip(unigrams, unigrams[1:]))
    trigrams = map(
        lambda token: ("~".join(tt[0] for tt in token), "~".join(tt[1] for tt in token), "~".join(tt[2] for tt in token)),
        zip(unigrams, unigrams[1:], unigrams[2:]))
 
    # keep n-grams that fit the pattern
    filtered = filter(is_good_unigram, unigrams) + filter(is_good_bigram, bigrams) + filter(is_good_trigram, trigrams)
    return filtered
 
fgbg = map(preprocess, fgbg_files)
 
print fgbg[1]
[(u'money', u'MONEY', 'NNP'),
 (u'market', u'MARKET', 'NNP'),
 (u'deficit', u'DEFICIT', 'NNP'),
 ...
 (u'money~market', u'MONEY~MARKET', 'NNP~NNP'),
 (u'market~deficit', u'MARKET~DEFICIT', 'NNP~NNP'),
 (u'deficit~forecast', u'DEFICIT~FORECAST', 'NNP~NNP'),
 (u'forecast~at', u'FORECAST~AT', 'NNP~NNP'),
 (u'250~mln', u'250~MLN', 'CD~NNP'),
 ...
]

Please note that NLTK is not well-optimized library and doesn't fit for production use in many cases. For instance, if you use nltk.pos_tag function directly, it will load POS tagger from the disk on every call (that will cost you about 15 sec). To avoid that, we need to bypass the main method and call the private one directly.

Now it's time to define that kind of tokens we are interesting in. Let's consider unigrams, bigrams and trigrams that satisfy the following criteria:

unigrams
- noun with lemma 3 characters or more (<NN*>)
bigrams
- a number followed by a noun (<CD><NN*>, e.g. 6.1 billions)
- a noun followed by a noun (<NN*><NN*>, e.g. milk chocolate)
- an adjective followed by a noun (<JJ><NN*>>, e.g. beautiful flowers)
trigrams
- 3 consequative nouns (<NN*><NN*><NN*>)
- an adjective followed by 2 nouns (<JJ><NN*><NN*>, e.g. white motor yacht)
- 2 adjectives followed by a noun (<JJ><JJ><NN*>, e.g. big old house)
- a noun followed by preposition conjunction followed by a noun (<NN*><IN><NN*>, e.g. quality of service)

def is_good_unigram(unigram):
    """Check if the provided unigram satisfy the criteria."""
    lemma, word, tag = unigram
 
    if not tag.startswith("NN"):
        return False
    if len(lemma) < 3:
        return False
    return True
 
def is_good_bigram(bigram):
    """Check if the provided bigram satisfy the criteria."""
    tags = bigram[2].split("~")
 
    if tags[0] == "CD" and tags[1].startswith("NN"):
        return True
 
    if tags[0].startswith("NN") and tags[1].startswith("NN"):
        return True
 
    if tags[0].startswith("JJ") and tags[1].startswith("NN"):
        return True
 
    return False
 
def is_good_trigram(trigram):
    """Check if the provided trigram satisfy the criteria."""
    tags = trigram[2].split("~")
 
    if tags[0].startswith("NN") and tags[1].startswith("NN") and tags[2].startswith("NN"):
        return True
 
    if tags[0] == "JJ" and tags[1].startswith("NN") and tags[2].startswith("NN"):
        return True
 
    if tags[0] == "JJ" and tags[1] == "JJ" and tags[2].startswith("NN"):
        return True
 
    if tags[0].startswith("NN") and tags[1] == "IN" and tags[2].startswith("NN"):
        return True
 
    return False

After preprocessing step we have all our tokens in the form (lemma, word, tag). We will perform estimations on lemmas only, so we build a dictionary to convert back from lemmas to the words. (Many word forms can have same lemma and for analysis we don't want to distinguish between them, but for output it might be interesting which form appeared in the text)

import numpy as np
from collections import defaultdict, Counter
from operator import itemgetter
 
lemma_reverse = defaultdict(Counter)
 
for tokens in fgbg:
    for lemma, word, pos in tokens:
        lemma_reverse[lemma][word] += 1
 
 
fgbg_lemmas = map(lambda tokens: map(itemgetter(0), tokens), fgbg)
fgbg_labels = np.asarray(fgbg_labels)

Our methods are based on frequency analysis on the tokens in the corpus, so we need to calculate tokens appearance in the document and convert list of document to the matrix of counts. In the code snippet below I use CountVectorizer from popular python library scikit-learn, but it's straightforward to do that in pure Python as well.

from sklearn.feature_extraction.text import CountVectorizer
 
vectorizer_binary = CountVectorizer(binary=True, preprocessor=lambda t: t, tokenizer=lambda t: t)
counts_binary = vectorizer_binary.fit_transform(fgbg_lemmas)
feature_names = vectorizer_binary.get_feature_names()
 
vectorizer_full = CountVectorizer(binary=False, preprocessor=lambda t: t, tokenizer=lambda t: t)
counts_full = vectorizer_full.fit_transform(fgbg_lemmas)

Well, we are ready to apply our methods, but let's first build a function that will print top extracted domain-specific terms from the ranked list by such algorithms:

def print_top_ranked(ranking, scores, feature_names, size=10, glue="~"):
    """Print top ranked lemmas and its 2 most frequent forms."""
    print "-----------------------------------"
    print "Top up to %d lemmas by score" % size
    print "-----------------------------------"
    for idx in ranking[:size]:
        lemma = feature_names[idx]
        forms = [w for w, cnt in lemma_reverse[lemma].most_common(2)]
        print "%-40s %-20s (%s)" % (lemma.replace(glue, " "), scores[idx], ",".join(forms or [lemma]))
    print "--------------------------------"

Domain Specificity method

Domain Specificity method was proposed by Park et al. (2008) and directly compare term frequencies in documents for a given domain with term frequencies in the general document collection. We can define domain specificity of a token as the relative probability of occurrence of the word in a domain text versus in a general text.

$"domain_specificity"("token") = \frac{p_{fg}("token")}{p_{bg}("token")} = \frac{\frac{count_{fg}("token")}{N_{fg}}}{\frac{count_{bg}("token")}{N_{bg}}}$

where, $p_{fg}("token")$ is the probability of the token in a domain-specific (foreground) corpus, and $p_{bg}("token")$ is the probability of the token in a general (background) corpus. Using MLE probability estimation every such probability is the number of occurrences of the token $count_{"*"}("token")$ in the text divided by the total number of tokens $N_{"*"}$ in the domain corpus and in the general corpus respectively. In our implementation if $count_{bg}("token") = 0$ , then we assume denominator set to $1$ .

def compute_domain_specificity(counts, labels):
    """Compute domain specificity score for each token in the corpus
    according to Park et al. (2008).
    """
    fg_counts = np.asarray(counts[labels == 1].sum(axis=0)).reshape(-1)
    fg_total = fg_counts.sum()
    bg_counts = np.asarray(counts[labels == 0].sum(axis=0)).reshape(-1)
    bg_total = bg_counts.sum()
 
    fg_probas = fg_counts / float(fg_total)
    bg_probas = bg_counts / float(bg_total)
    bg_probas[np.isclose(bg_probas, 0)] = 1
 
    return fg_probas / bg_probas
 
def rank_with_domain_specificity(counts, labels):
    """Get domain specificity scores for tokens and sort them
    in descending order.
    """
    scores_spf = compute_domain_specificity(counts, labels)
    ranking_spf = scores_spf.argsort()[::-1]
 
    return scores_spf, ranking_spf

Let's run Domain Specificity method on our corpus and print top-scored 25 lemmas, and for each lemma output 2 most frequent forms:

scores_spf, ranking_spf = rank_with_domain_specificity(counts_binary, fgbg_labels)
print_top_ranked(ranking_spf, scores_spf, feature_names, 25)
 
-----------------------------------
Top up to 25 lemmas by score
-----------------------------------
system today                             250.149901327        (system~today)
total help                               250.149901327        (total~help)
band                                     172.943141658        (band,bands)
amount of dollar                         157.501789724        (amount~of~dollars,amounts~of~dollars)
day system                               148.236978564        (DAY~SYSTEM,day~System)
money market shortage                    148.236978564        (MONEY~MARKET~SHORTAGE,money~market~shortage)
currency stability                       129.707356244        (currency~stability,CURRENCY~STABILITY)
money market dealer                      129.707356244        (Money~market~dealers,money~market~dealers)
afternoon session                        129.707356244        (afternoon~session)
market dealer                            129.707356244        (market~dealers,market~dealer)
accord on currency                       129.707356244        (accord~on~currency,accords~on~currency)
discount window                          120.442545083        (discount~window,DISCOUNT~WINDOW)
major nation                             120.442545083        (major~nations,MAJOR~NATIONS)
lower house                              111.177733923        (Lower~House,lower~house)
forecast revised                         111.177733923        (FORECAST~REVISED)
1 money                                  111.177733923        (1~money,1~MONEY)
security repurchase                      111.177733923        (securities~repurchase,security~repurchase)
bank discount                            111.177733923        (BANK~DISCOUNT,bank~discount)
federal reserve bank                     111.177733923        (Federal~Reserve~Bank)
bank bill                                104.229125553        (bank~bills,bank~bill)
free reserve                             101.912922763        (free~reserves,FREE~RESERVES)
money market intervention                101.912922763        (money~market~intervention,MONEY~MARKET~INTERVENTION)
senior dealer                            101.912922763        (senior~dealer,senior~dealers)
22 paris                                 101.912922763        (22~Paris)
west german interest                     101.912922763        (West~German~interest)
--------------------------------

Domain-Specific TF-IDF method

Domain-Specific TF-IDF method used in this article was proposed by Su Nam Kim et al. (2009). It is an unsupervised method that based on TF-IDF. The basic underlying idea is that domain-specific terms occur in a particular domain with markedly higher frequency than they do in other domains, similar to term frequency patterns captured by TF-IDF.

The calculation of term frequency (TF) of the token from domain corpus is via:

$TF$"token"$ = \frac{count_{fg}("token")}{\sum_{t \in fg}count_{fg}(t)}$

where $count_{fg}("token")$ is the number of occurrences of the token in the domain (foreground) corpus.

The inverse domain frequency (IDF) is calculated via:

$IDF$"token"$ = log$\frac{\|fgbg\|}{1 + \|\{d \in fgbg: "token" \in d\}\|}$$

where $fgbg$ is the set of all documents (from foreground and background corpora).

The final TF-IDF value of a given token is the simple product of TF and IDF:

$"score"$"token"$ = TF$"token"$ * IDF$"token"$$

def compute_tfidf_score(binary_counts, full_counts, labels):
    """Compute domain-specific TF-IDF score for each token in the corpus
    according to Su Nam Kim et al. (2009).
    """
    fg_counts = np.asarray(full_counts[labels == 1].sum(axis=0)).reshape(-1)
    fg_total = fg_counts.sum()
    tf = fg_counts / float(fg_total)
 
    fgbg_counts = np.asarray(binary_counts.sum(axis=0)).reshape(-1)
    idf = np.log(binary_counts.shape[0] / (1.0 + fgbg_counts))
 
    return tf * idf
 
 
def rank_with_tfidf(counts, counts_full, labels):
    """Get domain-specific TF-IDF scores for tokens and sort them
    in descending order.
    """
    scores_tfidf = compute_tfidf_score(counts, counts_full, labels)
    ranking_tfidf = scores_tfidf.argsort()[::-1]
 
    return scores_tfidf, ranking_tfidf

Let's run Domain-Specific TF-IDF method on our corpus and print top-scored 25 lemmas, and for each lemma output 2 most frequent forms:

scores_tfidf, ranking_tfidf = rank_with_tfidf(counts_binary, counts_full, fgbg_labels)
print_top_ranked(ranking_tfidf, scores_tfidf, feature_names, 25)
 
-----------------------------------
Top up to 25 lemmas by score
-----------------------------------
bank                                     0.0369874904299      (Bank,bank)
dollar                                   0.035401710765       (dollar,dollars)
rate                                     0.0265577608188      (rate,rates)
currency                                 0.022461129089       (currency,currencies)
money                                    0.0223101607208      (money,MONEY)
market                                   0.0208501955881      (market,markets)
yen                                      0.0189193790625      (yen,YEN)
fed                                      0.0188461776281      (Fed,FED)
dealer                                   0.01660549817        (dealers,Dealers)
central bank                             0.0158485127312      (central~bank,central~banks)
stg                                      0.0158294310816      (stg,STG)
exchange                                 0.0143544510312      (exchange,Exchange)
japan                                    0.0143076802268      (Japan,JAPAN)
pct                                      0.014077863384       (pct,PCT)
money market                             0.0139880189231      (money~market,MONEY~MARKET)
mln stg                                  0.0132203922713      (mln~stg,MLN~STG)
reserve                                  0.0127344930092      (reserves,Reserve)
exchange rate                            0.0118593483746      (exchange~rate,exchange~rates)
policy                                   0.0117568063381      (policy,policies)
bundesbank                               0.0115089456151      (Bundesbank,BUNDESBANK)
week                                     0.0108928669553      (week,weeks)
mark                                     0.0108900895939      (marks,mark)
paris                                    0.0104693464516      (Paris,PARIS)
treasury                                 0.0102797125621      (Treasury,treasury)
baker                                    0.0101801818088      (Baker,BAKER)
--------------------------------

Conclusion

As you can see above, results from above methods are different, but pretty much close to our expectation of terms from finance news domain. By the way, it seems that many financial news on Reuters 1987 were dedicated to German finance (Bundesbank, mark, west german interest).

Of course, there are number of noisy words that aren't obviously connected to the finance domain, but co-occured in texts from the corpus (e.g. total help, week). You might want to filter them out.

In general, I can say that the methods are very fast and provide good enough results for the first iteration. You may increase the quality of results by using better POS tagger and filtering for other patterns in tokens. As the next step I would also consider such methods as Latent Dirichlet Allocation (LDA), Word2Vec, and others.

Y. Park, S. Patwardhan, K. Visweswariah and S. C. Gates. An Empirical Analysis of Word Error Rate and Keyword Error Rate. In Proceedings of ICSLP. (2008)
Su Nam Kim, Timothy Baldwin and Min-Yen Kan. An Unsupervised Approach to Domain-Specific Term Extraction. In Proceedings of the Australasian Language Technology Association Workshop (ALTW:B), pp. 94-98 (2009)