Twitter analysis for Strata+Hadoop World (BCN, 2014) with Apache Spark and D3
Using the official hashtag #StrataHadoop, I've made a basic analysis of Twitter activity during the Strata+Hadoop World conference that was held on 19-21 November 2014 in Barcelona, Spain.
Dataset of tweets
We use python library TwitterSearch to obtain the dataset with all tweets that contain hashtag
#StrataHadoop(case insensitive) and were created in the interval from 15.10.2014 to 24.11.2014
As the result, we have
stratahadoop.json which you can download here.
It's known that Twitter Search API is limited and there is no guarantee that we can find all tweets, but we assume that the amount we have received is a representative enough for our simple analysis.
Analysis using Apache Spark and D3.js
Apache Spark™ is a fast and general engine for large-scale data processing. It's up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Apache Spark supports Java, Scala and Python out of the box. We will use Spark Python API (PySpark) for data mining from the dataset of tweets we have collected before.
It isn't really necessary to use such powerful framework as Apache Spark for such small dataset, but our goal is also to illustrate how easy and comfortable could be data mining with Apache Spark.
Basic dataset characteristics
Let's load the dataset into SparkContext and check number of tweets:
So, we have 8485 tweets in our dataset. It's interesting to separate such tweets that were created during the conference dates (
created_during_event), because they are expected to be different and more related to the conference talks and discussions.
Let's calculate number of such tweets:
As the result, there are 5686 tweets (67%) that were created during the conference dates. We will investigate the distribution of tweets by hour in the another section below.
To better imaging the percentage, we can do simple visualization with D3:
We can go even more deep and try to filter tweets, that were created inside or near to the conference's venue. Unfortunately, Twitter Search API doesn't support query and geocode search at the same time, so we can do simple filtering with Spark (
close_to_venue) using approximate geobox (http://boundingbox.klokantech.com/)
Not every tweet contains information about its geo location, so we need to filter out such tweets first.
Well, not so many people embed geo information into their tweets - about 3.26% for the whole period and, slightly better, about 4.76% during the conference. But, as you can see, from such people with enabled geo data 173 (63.8%) were actually participants of the event.
From now we will work mostly with the tweets, that were created during the event, so it make sense to cache such RDD at this point. Of course, it almost have no sense to do that for such small dataset, but it helps a lot if you deal with huge datasets.
Hashtags and authors
The analogue of the well-known "Hello world" example, but for the bigdata frameworks, is the task to calculate number of words in text. We will adapt that example and calculate most valuable hashtags and tweets authors for our dataset.
Let's start with hashtags and calculate TOP 25 of popular hashtags in tweets during the event:
As expected, the most popular hashtag is
#stratahadoop that was used to build our dataset ;) We also can see, that it appears 5689 times, so somebody was so exiting that used the hashcode a few times in the same tweet (for example, with different letters case). Other popular hastags are
hadoop which are clearly highlight the main topic of the conference.
Analysis of popular hashtags and top authors could very good define the topics of conferences and could be easy used as a part of a recommendation system to suggest similar events.
With D3 and
d3-cloud it's fairly easy to visualise such attributes with respect to the dependent attribute
Now, let's do the same analysis for authors:
The official account on the conference
@strataconf tweeted a lot, but probably it was mostly retweets? Let's check that too:
Actually, we were right. Official Twitter account of the Strata+Hadoop
@strataconf was retweeted a lot, but it also created the most amount of tweets during the event according to our dataset
Tweets distribution by hour
Finally, let's look how tweets were distributed in time during the conference and calculate the histogram. As a bucket we will use the 1 hour interval:
With D3 it's easy now to visualize such histogram with bar plot:
From the diagrams we could clearly see 3 clusters that correspond to days of the conference. First day on Strata+Hadoop World Barcelona was a training day, this is the explanation why the activity in Twitter more homogeneous and not so high. Other days have strong outliers around 9am-10am which correspond to the keynote presentations according to the conference schedule made by the well-known speakers.
At the end, let's display profile images of authors who posted or retweeted about the conference with the hashtag
#StrataHadoop. Since the amount of such users is big, we will consider only authors who tweeted/retweeted at least 10 times and then randomly display 10 of them:
Apache Spark and D3 give us very comfortable way for first look on the data. They also have very nice features we don't show in this article, that makes them a really good solution for data analysis. Of course, there is no universal solution and every researcher should consider other tools and technologies. For instance, because our data is could be easy represented as time-series and not so big, we can also use Elasticsearch and Kibana that provide easy way to query and visualize such data (I will show how to use them in the next article).
IPython Notebook StrataHadoop.ipynb with PySpark examples could be downloaded from our GitHub repository.
- TwitterSearch is a library to easily iterate tweets found by the Twitter API
- Apache Spark is a fast and general engine for large-scale data processing.
- Elasticsearch is a powerful open source search and analytics engine that makes data easy to explore
- Kibana is an interactive tool to visualize logs and time-stamped data