An automatic terms extraction for Domain-specific corpora
Using simple frequency-based methods, such as Domain Specificity method and Domain-Specific TF-IDF, it is possible to automatically extract and score terms for given domain-specific corpus. In this article, we will use Python and its ecosystem to illustrate such methods in action.
Let's start with the definition of domain-specific terms. If a term occurs relatively more frequently in a domain-specific text than in a non-domain text, the term is regarded as domain-specific. The goal of domain-specific terms extraction is to automatically extract such terms from given corpora.
An introduction to the text preprocessing
Usually texts from the corpora can't be used without additional preparation steps required by the specific task we want to perform on. Let's briefly go through such steps.
Tokenization
Tokenization is the process of splitting a text into individual words, sequences of words (n-grams),
symbols, or other meaningful elements called tokens
. However, it is sometimes difficult
to
define what is meant by a "word" and it could even vary for different problems. Tokenization is also
a
language-specific problem and, for instance, in most cases approach for English won't work for
Chinese
and vice versa.
There are many ways to tokenize a text. For instance, we can just split text by punctuation or whitespace characters (e.g. space, line break, etc.)
1 2 3 4 5 6 | from nltk import word_tokenize text = word_tokenize( "This is an example." ) print text [ 'This' , 'is' , 'an' , 'example' , '.' ] |
Part-of-speech tagging
Part-of-speech tagging (or POS tagging) is a process of assigning parts-of-speech to words. It is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken.
Why POS tagging is important for us? Because we want to select cadidates for our domain-specific terms based on POS patterns in n-grams or just by keeping nouns only.
There are some popular POS taggers for English language you might be interested in checking out:
1 2 3 4 5 6 | from nltk import pos_tag, word_tokenize text = word_tokenize( "This is an example." ) print pos_tag(text) [( 'This' , 'DT' ), ( 'is' , 'VBZ' ), ( 'an' , 'DT' ), ( 'example' , 'NN' ), ( '.' , '.' )] |
Stemming and Lemmatization
Stemming is a process that chops off the ends of words in the hope of achieving
this
goal correctly most of the time, and often includes the removal of derivational affixes. A more
complex
approach to the problem of determining a stem of a word is lemmatization. It tries
to
do things right and uses a vocabulary, performs morphological analysis and applies different
normalization rules for each part of speech. The canonical form returned by the lemmatization
process is
known as lemma
.
The most common algorithm for stemming English is The Porter Stemming Algorithm by Martin Porter (1980). You can find many implementations of this algorithm as well as other approaches based on stochastic algorithm, n-grams, and so on.
1 2 3 4 5 | from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print lemmatizer.lemmatize( "corpora" ) # corpus |
Word normalization
Word normalization is performed to aggregate all different expressions for a same concept. In speech transcripts, many words are used in several different variations such as inflections, abbreviations, and alternative spellings (e.g., UK and US spellings). To get good result it make sense to identify all such variations of a word and aggregate them into a canonical form.
Other
There are many others steps of preprocessing that are not mentioned in this article. If you are interested in more details, check Preprocessing section from Text Analysis with Topic Models for the Humanities and Social Sciences tutorial by Allen Riddell.
Our pipeline
In this example we use Reuters-21578 Text Categorization Collection - collection of documents that appeared on Reuters newswire in 1987. 10,788 news documents were assembled and indexed with 90 categories. You can use such collection directly from NLTK that provides already tokenized documents and able to filter by category.
The foreground corpus (fg, domain-specific corpus) we build from documents
indexed
with categories money-fx
and money-supply
, so our domain is expected to be
finance-related. All other documents we use as our background corpus (bg, general
corpus without any domain specificity). There are 883 documents in the foreground corpus, and 9905
documents in the background corpus.
Preparing a good foreground and background copora is not a trivial problem and there are many requirements and rules how to do so. But this is beyond the scope of this article.
1 2 3 4 5 6 7 | from nltk.corpus import reuters fg_files = reuters.fileids([ "money-fx" , "money-supply" ]) bg_files = filter ( lambda file_id: file_id not in fg_files, reuters.fileids()) fgbg_files = fg_files + bg_files fgbg_labels = [ 1 ] * len (fg_files) + [ 0 ] * len (bg_files) |
Our preprocessing will contain tokenization (done by NLTK implicitly), part-of-speech tagging and lemmatization of each word:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | import itertools import nltk.tag from nltk.tag.perceptron import PerceptronTagger from nltk.stem import WordNetLemmatizer tagger = PerceptronTagger() lemmatizer = WordNetLemmatizer() def preprocess(file_id): """Get text from file_id and perform basic preprocessing steps.""" bag_of_words = reuters.words(file_id) tagged = nltk.tag._pos_tag(bag_of_words, None , tagger) # calculate bigrams, unigrams and trigrams, # and unify the form by join words with "~" unigrams = map ( lambda (w, t): (lemmatizer.lemmatize(w.lower()), w, t), tagged) bigrams = map ( lambda token: ( "~" .join(tt[ 0 ] for tt in token), "~" .join(tt[ 1 ] for tt in token), "~" .join(tt[ 2 ] for tt in token)), zip (unigrams, unigrams[ 1 :])) trigrams = map ( lambda token: ( "~" .join(tt[ 0 ] for tt in token), "~" .join(tt[ 1 ] for tt in token), "~" .join(tt[ 2 ] for tt in token)), zip (unigrams, unigrams[ 1 :], unigrams[ 2 :])) # keep n-grams that fit the pattern filtered = filter (is_good_unigram, unigrams) + filter (is_good_bigram, bigrams) + filter (is_good_trigram, trigrams) return filtered fgbg = map (preprocess, fgbg_files) print fgbg[ 1 ] [(u 'money' , u 'MONEY' , 'NNP' ), (u 'market' , u 'MARKET' , 'NNP' ), (u 'deficit' , u 'DEFICIT' , 'NNP' ), ... (u 'money~market' , u 'MONEY~MARKET' , 'NNP~NNP' ), (u 'market~deficit' , u 'MARKET~DEFICIT' , 'NNP~NNP' ), (u 'deficit~forecast' , u 'DEFICIT~FORECAST' , 'NNP~NNP' ), (u 'forecast~at' , u 'FORECAST~AT' , 'NNP~NNP' ), (u '250~mln' , u '250~MLN' , 'CD~NNP' ), ... ] |
Please note that NLTK is not well-optimized library and
doesn't
fit for production use in many cases. For instance, if you use nltk.pos_tag
function
directly, it will load POS tagger from the disk on every call (that will cost you about 15 sec). To
avoid that, we need to bypass the main method and call the private one directly.
Now it's time to define that kind of tokens we are interesting in. Let's consider unigrams, bigrams and trigrams that satisfy the following criteria:
-
unigrams
- noun with lemma 3 characters or more (
<NN*>
)
- noun with lemma 3 characters or more (
-
bigrams
- a number followed by a noun (
<CD><NN*>
, e.g. 6.1 billions) - a noun followed by a noun (
<NN*><NN*>
, e.g. milk chocolate) - an adjective followed by a noun (
<JJ><NN*>>
, e.g. beautiful flowers)
- a number followed by a noun (
-
trigrams
- 3 consequative nouns (
<NN*><NN*><NN*>
) - an adjective followed by 2 nouns (
<JJ><NN*><NN*>
, e.g. white motor yacht) - 2 adjectives followed by a noun (
<JJ><JJ><NN*>
, e.g. big old house) - a noun followed by preposition conjunction followed by a noun
(
<NN*><IN><NN*>
, e.g. quality of service)
- 3 consequative nouns (
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | def is_good_unigram(unigram): """Check if the provided unigram satisfy the criteria.""" lemma, word, tag = unigram if not tag.startswith( "NN" ): return False if len (lemma) < 3 : return False return True def is_good_bigram(bigram): """Check if the provided bigram satisfy the criteria.""" tags = bigram[ 2 ].split( "~" ) if tags[ 0 ] = = "CD" and tags[ 1 ].startswith( "NN" ): return True if tags[ 0 ].startswith( "NN" ) and tags[ 1 ].startswith( "NN" ): return True if tags[ 0 ].startswith( "JJ" ) and tags[ 1 ].startswith( "NN" ): return True return False def is_good_trigram(trigram): """Check if the provided trigram satisfy the criteria.""" tags = trigram[ 2 ].split( "~" ) if tags[ 0 ].startswith( "NN" ) and tags[ 1 ].startswith( "NN" ) and tags[ 2 ].startswith( "NN" ): return True if tags[ 0 ] = = "JJ" and tags[ 1 ].startswith( "NN" ) and tags[ 2 ].startswith( "NN" ): return True if tags[ 0 ] = = "JJ" and tags[ 1 ] = = "JJ" and tags[ 2 ].startswith( "NN" ): return True if tags[ 0 ].startswith( "NN" ) and tags[ 1 ] = = "IN" and tags[ 2 ].startswith( "NN" ): return True return False |
After preprocessing step we have all our tokens in the form (lemma, word, tag)
.
We will perform estimations on lemmas only, so we build a dictionary to convert back from lemmas to
the
words. (Many word forms can have same lemma and for analysis we don't want to distinguish between
them,
but for output it might be interesting which form appeared in the text)
1 2 3 4 5 6 7 8 9 10 11 12 13 | import numpy as np from collections import defaultdict, Counter from operator import itemgetter lemma_reverse = defaultdict(Counter) for tokens in fgbg: for lemma, word, pos in tokens: lemma_reverse[lemma][word] + = 1 fgbg_lemmas = map ( lambda tokens: map (itemgetter( 0 ), tokens), fgbg) fgbg_labels = np.asarray(fgbg_labels) |
Our methods are based on frequency analysis on the tokens in the corpus, so we need to calculate
tokens
appearance in the document and convert list of document to the matrix of counts. In the code snippet
below I use CountVectorizer
from popular python library scikit-learn, but it's
straightforward to do that in pure Python as well.
1 2 3 4 5 6 7 8 | from sklearn.feature_extraction.text import CountVectorizer vectorizer_binary = CountVectorizer(binary = True , preprocessor = lambda t: t, tokenizer = lambda t: t) counts_binary = vectorizer_binary.fit_transform(fgbg_lemmas) feature_names = vectorizer_binary.get_feature_names() vectorizer_full = CountVectorizer(binary = False , preprocessor = lambda t: t, tokenizer = lambda t: t) counts_full = vectorizer_full.fit_transform(fgbg_lemmas) |
Well, we are ready to apply our methods, but let's first build a function that will print top extracted domain-specific terms from the ranked list by such algorithms:
1 2 3 4 5 6 7 8 9 10 | def print_top_ranked(ranking, scores, feature_names, size = 10 , glue = "~" ): """Print top ranked lemmas and its 2 most frequent forms.""" print "-----------------------------------" print "Top up to %d lemmas by score" % size print "-----------------------------------" for idx in ranking[:size]: lemma = feature_names[idx] forms = [w for w, cnt in lemma_reverse[lemma].most_common( 2 )] print "%-40s %-20s (%s)" % (lemma.replace(glue, " " ), scores[idx], "," .join(forms or [lemma])) print "--------------------------------" |
Domain Specificity method
Domain Specificity method was proposed by Park et al. (2008) and directly compare term frequencies in documents for a given domain with term frequencies in the general document collection. We can define domain specificity of a token as the relative probability of occurrence of the word in a domain text versus in a general text.
domain_specificity(token)=pfg(token)pbg(token)=countfg(token)Nfgcountbg(token)Nbg
where, pfg(token) is the probability of the token in a domain-specific (foreground) corpus, and pbg(token) is the probability of the token in a general (background) corpus. Using MLE probability estimation every such probability is the number of occurrences of the token count*(token) in the text divided by the total number of tokens N* in the domain corpus and in the general corpus respectively. In our implementation if countbg(token)=0, then we assume denominator set to 1.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | def compute_domain_specificity(counts, labels): """Compute domain specificity score for each token in the corpus according to Park et al. (2008). """ fg_counts = np.asarray(counts[labels = = 1 ]. sum (axis = 0 )).reshape( - 1 ) fg_total = fg_counts. sum () bg_counts = np.asarray(counts[labels = = 0 ]. sum (axis = 0 )).reshape( - 1 ) bg_total = bg_counts. sum () fg_probas = fg_counts / float (fg_total) bg_probas = bg_counts / float (bg_total) bg_probas[np.isclose(bg_probas, 0 )] = 1 return fg_probas / bg_probas def rank_with_domain_specificity(counts, labels): """Get domain specificity scores for tokens and sort them in descending order. """ scores_spf = compute_domain_specificity(counts, labels) ranking_spf = scores_spf.argsort()[:: - 1 ] return scores_spf, ranking_spf |
Let's run Domain Specificity method on our corpus and print top-scored 25 lemmas, and for each lemma output 2 most frequent forms:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | scores_spf, ranking_spf = rank_with_domain_specificity(counts_binary, fgbg_labels) print_top_ranked(ranking_spf, scores_spf, feature_names, 25 ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Top up to 25 lemmas by score - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - system today 250.149901327 (system~today) total help 250.149901327 (total~ help ) band 172.943141658 (band,bands) amount of dollar 157.501789724 (amount~of~dollars,amounts~of~dollars) day system 148.236978564 (DAY~SYSTEM,day~System) money market shortage 148.236978564 (MONEY~MARKET~SHORTAGE,money~market~shortage) currency stability 129.707356244 (currency~stability,CURRENCY~STABILITY) money market dealer 129.707356244 (Money~market~dealers,money~market~dealers) afternoon session 129.707356244 (afternoon~session) market dealer 129.707356244 (market~dealers,market~dealer) accord on currency 129.707356244 (accord~on~currency,accords~on~currency) discount window 120.442545083 (discount~window,DISCOUNT~WINDOW) major nation 120.442545083 (major~nations,MAJOR~NATIONS) lower house 111.177733923 (Lower~House,lower~house) forecast revised 111.177733923 (FORECAST~REVISED) 1 money 111.177733923 ( 1 ~money, 1 ~MONEY) security repurchase 111.177733923 (securities~repurchase,security~repurchase) bank discount 111.177733923 (BANK~DISCOUNT,bank~discount) federal reserve bank 111.177733923 (Federal~Reserve~Bank) bank bill 104.229125553 (bank~bills,bank~bill) free reserve 101.912922763 (free~reserves,FREE~RESERVES) money market intervention 101.912922763 (money~market~intervention,MONEY~MARKET~INTERVENTION) senior dealer 101.912922763 (senior~dealer,senior~dealers) 22 paris 101.912922763 ( 22 ~Paris) west german interest 101.912922763 (West~German~interest) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
Domain-Specific TF-IDF method
Domain-Specific TF-IDF method used in this article was proposed by Su Nam Kim et al. (2009). It is an unsupervised method that based on TF-IDF. The basic underlying idea is that domain-specific terms occur in a particular domain with markedly higher frequency than they do in other domains, similar to term frequency patterns captured by TF-IDF.
The calculation of term frequency (TF) of the token from domain corpus is via:
TF(token)=countfg(token)∑t∈fgcountfg(t)
where countfg(token) is the number of occurrences of the token in the domain (foreground) corpus.
The inverse domain frequency (IDF) is calculated via:
IDF(token)=log(|fgbg|1+|{d∈fgbg:token∈d}|)
where fgbg is the set of all documents (from foreground and background corpora).
The final TF-IDF value of a given token is the simple product of TF and IDF:
score(token)=TF(token)⋅IDF(token)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | def compute_tfidf_score(binary_counts, full_counts, labels): """Compute domain-specific TF-IDF score for each token in the corpus according to Su Nam Kim et al. (2009). """ fg_counts = np.asarray(full_counts[labels = = 1 ]. sum (axis = 0 )).reshape( - 1 ) fg_total = fg_counts. sum () tf = fg_counts / float (fg_total) fgbg_counts = np.asarray(binary_counts. sum (axis = 0 )).reshape( - 1 ) idf = np.log(binary_counts.shape[ 0 ] / ( 1.0 + fgbg_counts)) return tf * idf def rank_with_tfidf(counts, counts_full, labels): """Get domain-specific TF-IDF scores for tokens and sort them in descending order. """ scores_tfidf = compute_tfidf_score(counts, counts_full, labels) ranking_tfidf = scores_tfidf.argsort()[:: - 1 ] return scores_tfidf, ranking_tfidf |
Let's run Domain-Specific TF-IDF method on our corpus and print top-scored 25 lemmas, and for each lemma output 2 most frequent forms:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | scores_tfidf, ranking_tfidf = rank_with_tfidf(counts_binary, counts_full, fgbg_labels) print_top_ranked(ranking_tfidf, scores_tfidf, feature_names, 25 ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Top up to 25 lemmas by score - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - bank 0.0369874904299 (Bank,bank) dollar 0.035401710765 (dollar,dollars) rate 0.0265577608188 (rate,rates) currency 0.022461129089 (currency,currencies) money 0.0223101607208 (money,MONEY) market 0.0208501955881 (market,markets) yen 0.0189193790625 (yen,YEN) fed 0.0188461776281 (Fed,FED) dealer 0.01660549817 (dealers,Dealers) central bank 0.0158485127312 (central~bank,central~banks) stg 0.0158294310816 (stg,STG) exchange 0.0143544510312 (exchange,Exchange) japan 0.0143076802268 (Japan,JAPAN) pct 0.014077863384 (pct,PCT) money market 0.0139880189231 (money~market,MONEY~MARKET) mln stg 0.0132203922713 (mln~stg,MLN~STG) reserve 0.0127344930092 (reserves,Reserve) exchange rate 0.0118593483746 (exchange~rate,exchange~rates) policy 0.0117568063381 (policy,policies) bundesbank 0.0115089456151 (Bundesbank,BUNDESBANK) week 0.0108928669553 (week,weeks) mark 0.0108900895939 (marks,mark) paris 0.0104693464516 (Paris,PARIS) treasury 0.0102797125621 (Treasury,treasury) baker 0.0101801818088 (Baker,BAKER) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
Conclusion
As you can see above, results from above methods are different, but pretty much close to our
expectation
of terms from finance news domain. By the way, it seems that many financial news on Reuters 1987
were
dedicated to German finance (Bundesbank
, mark
,
west german interest
).
Of course, there are number of noisy words that aren't obviously connected to the finance domain,
but
co-occured in texts from the corpus (e.g. total help
, week
). You might
want to
filter them out.
In general, I can say that the methods are very fast and provide good enough results for the first iteration. You may increase the quality of results by using better POS tagger and filtering for other patterns in tokens. As the next step I would also consider such methods as Latent Dirichlet Allocation (LDA), Word2Vec, and others.
Read More
- Y. Park, S. Patwardhan, K. Visweswariah and S. C. Gates. An Empirical Analysis of Word Error Rate and Keyword Error Rate. In Proceedings of ICSLP. (2008)
- Su Nam Kim, Timothy Baldwin and Min-Yen Kan. An Unsupervised Approach to Domain-Specific Term Extraction. In Proceedings of the Australasian Language Technology Association Workshop (ALTW:B), pp. 94-98 (2009)