Realtime Twitter Sentiment Analysis with Storm and Elasticsearch

Twitter analysis becomes one of the most popular fields of research because of its solid usage by millions people across the globe. There are many interesting scientific and research problems were solved using the analysis how and that people tweet. In this article we want to show how easy could be build a simple processing flow that can show interesting statistics out of twitter data.

Problem

We want to build a flow which could be easy to use for monitoring tweets sentiment in realtime. So, to start we need to select stack of technologies that can help to do that. One of the most popular frameworks for processing realtime streams is Apache Storm. Current stable version is 0.9.1-incubating. For string data we choose Elasticsearch.

Twitter Stream

Twitter Inc. allow to obtain some amount of tweets as a stream.They allow to filter such stream by language, keywords or location. For our research it's important to have English-only tweets, since we are limited by the linguistic analyses tool.

Sentiment

For sentiment analysis we use Stanford CoreNLP. Stanford CoreNLP provides a set of natural language analysis tools written in Java. It supports 5 level of sentiment: from 0 (very Negative) to 4 (Very Positive).

The mood we estimate by simple formula:

`"Mood" = \frac{-2 * count(VeryNeg) - count(Neg) + count(Pos) + 2 * count(VeryPos)}{count(VeryNeg) + count(Neg) + count(Pos) + count(VeryPos)}`

where all counts are calculated for the desired period.

As soon as the `"Mood"` vales has been calculated, we can assign a sentiment class:

pos - if Mood > 0.05
neu - if 0.05 <= Mood <= 0.05
neg - if Mood < -0.05

Versions

Storm 0.9.1
Elasticsearch 1.1
Twitter4J
Stanford CoreNLP 3.5.2

References

Apache Storm is a free and open source distributed realtime computation system.
Manning, Christopher D., Surdeanu, Mihai, Bauer, John, Finkel, Jenny, Bethard, Steven J., and McClosky, David. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.