Data Munging
Data was not organized in the manner that we needed to consume it. We passed
lists of films organized by date through a rest service built to query imdb.
The result is matrix of features that we used in our analysis. We used
a combination of logic written in python and java to clean and organize the
results.
The code for extraction is available here Data Munging
Our Data Sets
title | year | released | genre | rated | runtime | language | director | writer | metascore | rating | votes | budget | gross |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Danny Collins | 2015 | Apr | Comedy | R | 106 | English | Dan Fogelman | Dan Fogelman | 58 | 71 | 18347 | 10000000 | 5348317 |
Project Almanac | 2015 | Jan | Sci-Fi | PG-13 | 106 | English | Dean Israelite | Jason Pagan | 47 | 64 | 52591 | 12000000 | 22331028 |
Fifty Shades of Grey | 2015 | Feb | Drama | R | 125 | English | Sam Taylor-Johnson | Kelly Marcel | 46 | 41 | 209,678 | 40000000 | 166167230 |
Blackhat | 2015 | Jan | Action | R | 133 | English | Michael Mann | Morgan Davis Foehl | 51 | 54 | 35637 | 70000000 | 7097125 |
Stonewall | 2015 | Sep | Drama | R | 129 | English | Roland Emmerich | Jon Robin Baitz | 30 | 41 | 1176 | 13500000 | 186354 |
How to Be Single | 2016 | Feb | Comedy | R | 110 | English | Christian Ditter | Abby Kohn | 51 | 63 | 5292 | 38000000 | 18750000 |
Jupiter Ascending | 2015 | Feb | Action | PG-13 | 127 | English | Andy Wachowski, Lana Wachowski | Andy Wachowski | 40 | 54 | 129577 | 176000000 | 47387723 |
The Gallows | 2015 | Jul | Horror | R | 81 | English | Travis Cluff, Chris Lofing | Travis Cluff | 30 | 43 | 11177 | 100000 | 22757819 |
Shaun the Sheep Movie | 2015 | Aug | Animation | PG | 85 | N/A | Mark Burton, Richard Starzak | Mark Burton | 81 | 74 | 20,038 | 25000000 | 19321230 |
Final Girl | 2015 | Aug | Action | R | 90 | English | Tyler Shields | Adam Prince | N/A | 46 | 5533 | 8000000 | 5500000 |
Collecting Data for sentiment analysis from twitter
To make the analysis more robust we decided that a column should be added
for each film that would use sentiment analysis to infer if social media
had any impact on a films success. Our list of films was feed through a
loop to make multiple rest calls to twitter's rest api. Due to limitations
in twitter data access, 100 tweets per each movie were analyzed.
Positive vs Negative Sentiment
Example Data: Straight Outta Compton (2015)
The tweets were feed through a method in textblob that returned sentiment
analysis based on sophisticated implementation of Naive Bayes which returned
a score between -1 and 1.
In order to give fair scoring, only unique tweets per each movie were counted,
and sentences with neutral sentiment (usually advertisement or links) were not
counted towards calculating an average. This score was then averaged and scaled
and added to a column in our matrix of features.
Since social media tweets are unstructured sentences that contain special
characters, mispelled words and slangs, thus sorting out meaningful sentences
was a challenge. Also textblob library failed to provide accurate analysis on
sentences that contain both positive and negative words.
The movie Straight Outta Compton receives average sentiment score of +30.