A 30-second ad in this year's SuperBowl cost approximately $4.5 million. Clearly, companies value the attention of the fans a great deal. This is not surprising, as this year's SuperBowl had more than 114 million viewers. Thanks to the popularity of social media and the huge quantity of data generated by applications such as Twitter, we now have new opportunities to infer where viewers' attention is focused at different times throughout large scale media events, and what factors influence their attention. In this project, Twitter data is mined for new insights into how people's attention and interests vary over the course of such events, and what factors are most influential.
I have downloaded tweets using the Twitter Streaming API with the help of the Python package Tweepy, and stored them in an SQLite database. I used a filter so that I only capture tweets containing the word "superbowl". I have also restricted the database to tweets that occured during the game itself. After doing this, I am left with about 1.5 million tweets.
Now, I plot histograms of the occurence of various brand names in tweets by time. We can easily pick out the occurance of each of commercial airing by eye. I have chosen to show tweets containing "microsoft", "budweiser", "doritos", and "fiat" as examples. Using the radio buttons at the top of the plot, you can choose which data to display, and whether you want to see stacked or grouped histograms. This plot highlights the high degree to which the superbowl audience engages with the ads, and the fine temporal detail that we get from such a large dataset.
We look for bursts of activity on Twitter related to gameplay events like touchdowns and fieldgoals. By doing so, we can infer when touchdowns and field goals occur and which team is scoring, giving us a timeline for the event and a final score.
This is done by first considering all tweets containing the word "touchdown", and either "boston", "new england", "patriots", "seattle", or "seahawks". We place each tweet in a 2D parameter space, with one dimension being the time at which the tweet occurs, and the other being the "team" dimension. A tweet is at position -d in the "team" dimension if it contains "seattle" or "seahawks", and +d if it contains "boston", "new england", or "patriots". We then use the DBSCAN unsupervised clustering algorithm to find clusters of characteristic radius 6 minutes, and we choose d to be large enough so that tweets from different teams are widely spaced.
This process is repeated for tweets containing the phrase "field goal".
An advantage of DBSCAN over other clustering algorithms such as K-means is that we do not need to specify the number of clusters, so we don't need to have any prior knowledge about the total number of touchdowns or field goals. Histograms of tweet counts for both "touchdown" tweets and "field goal" tweets are displayed below, with clusters differentiated by color. Yellow vertical lines denote the actual times at which events occur.
By counting clusters, we infer the correct final score of Patriots: 28, Seahawks: 24.
We can compare the measured time at which touchdowns and field goals occur with the actual times. The measured time is computed by taking the 5th percentile time value for each cluster. This biases us toward the early tail of the distribution, which presumably is closer to the actual event. Interestingly, we find that the signal in the Twitter data lags behind the actual events by ~ 1.5 - 3.25 minutes. Presumably this is the timescale for someone to react to an event and submit at tweet. The results are tabulated below. Times are given in minutes after kickoff.
|actual time||calculated time||team||time delay|
|actual time||calculated time||team||time delay|
We can look at how each state responds to the incidence of touchdowns. While only a small fraction of tweets have geolocation data, we use the location listed in the user profile to infer the state in which the tweet originates. Each frame of the following animation shows the number of tweets containing the word "touchdown", normalized by the population of that state. We can not only see the flashes of activity when each touchdown is scored, we can see the geographic variation in response according to which team is scoring.
We find out what general categories of topics people are tweeting about during the superbowl, using natural language processing and clustering. We begin by finding the ten most tweeted words, ignoring stopwords like "and","or", "the", etc. We then group all the tweets into ten second increments. Within each time increment, we compute the tf-idf for each of the ten words. We then use the K-means clustering algorithm with k=3 to find the three categories of tweets that people tend to focus on in any given ten second increment. The top word(s) from each cluster are:
We can also plot separate histograms of the data points for each cluster, binning in time. By doing so we can see patterns in where the clusters occur in time. The first quarter of the game is mostly dominated by tweets about the Seahawks and Patriots, indicating that viewer attention is mostly focused on the game itself early on. However, attention shifts to the commercials by the end of the first quarter and stays there until halftime. At this point, it shifts again to Katy Perry and the halftime show, and mostly stays there for the rest of the game, except for a pocket of interest in the teams when Seattle scores at 150 minutes, and at the end of the game. Advertisers may conclude from this that commercials airing in the second quarter are likely to receive the most attention.
Let's see if we can find "trending" hashtags from the data itself. We are most interested in topics which see a sharp rise in tweet counts. We find the 50 most popular hashtags, then we count the frequency of each hashtag in ten minute intervals. We then consider topics for which the count frequency is very close to zero for at least one time interval. This is very useful because it can alert us to bursts of interest that we would not have otherwise been aware of.
We can try to measure people's opinion about ads aired during the game by performing a sentiment analytsis on the content of their tweets. We can derive a measure of how much people liked a given ad, and compare the results to usa today survey results published here.
For each brand name of interest, we collect all tweets containg the brand name, and use the sentiment analysis from the TextBlob natural language processing library. This is, essentially a bag of words model. We consider only brands for which there were at least 2000 tweets containing them. We plot the usa today rating vs our sentiment rating, and find a significant correlation. A simple linar fit yields an R^2 of 0.24, but this can be increased to 0.5 by including the quarter in which the ad aired as covariates.
The advantage of measuring sentiment via the Twitter stream instead of through surveys is that not only can you avoid the cost of administering the survey, you can get results in real time.