Data was not organized in the manner that we needed to consume it.  We passed
lists of films organized by date through a rest service built to query imdb.
The result is matrix of features that we used in our analysis.  We used
a combination of logic written in python and java to clean and organize the
results.  

The code for extraction is available here Data Munging

Our Data Sets


title year released genre rated runtime language director writer metascore rating votes budget gross
Danny Collins 2015 Apr Comedy R 106 English Dan Fogelman Dan Fogelman 58 71 18347 10000000 5348317
Project Almanac 2015 Jan Sci-Fi PG-13 106 English Dean Israelite Jason Pagan 47 64 52591 12000000 22331028
Fifty Shades of Grey 2015 Feb Drama R 125 English Sam Taylor-Johnson Kelly Marcel 46 41 209,678 40000000 166167230
Blackhat 2015 Jan Action R 133 English Michael Mann Morgan Davis Foehl 51 54 35637 70000000 7097125
Stonewall 2015 Sep Drama R 129 English Roland Emmerich Jon Robin Baitz 30 41 1176 13500000 186354
How to Be Single 2016 Feb Comedy R 110 English Christian Ditter Abby Kohn 51 63 5292 38000000 18750000
Jupiter Ascending 2015 Feb Action PG-13 127 English Andy Wachowski, Lana Wachowski Andy Wachowski 40 54 129577 176000000 47387723
The Gallows 2015 Jul Horror R 81 English Travis Cluff, Chris Lofing Travis Cluff 30 43 11177 100000 22757819
Shaun the Sheep Movie 2015 Aug Animation PG 85 N/A Mark Burton, Richard Starzak Mark Burton 81 74 20,038 25000000 19321230
Final Girl 2015 Aug Action R 90 English Tyler Shields Adam Prince N/A 46 5533 8000000 5500000


Collecting Data for sentiment analysis from twitter


 To make the analysis more robust we decided that a column should be added 
 for each film that would use sentiment analysis to infer if social media
 had any impact on a films success.  Our list of films was feed through a 
 loop to make multiple rest calls to twitter's rest api. Due to limitations
 in twitter data access, 100 tweets per each movie were analyzed.

Positive vs Negative Sentiment


Example Data: Straight Outta Compton (2015)



 The tweets were feed through a method in textblob that returned sentiment 
 analysis based on sophisticated implementation of Naive Bayes which returned 
 a score between -1 and 1. 
 In order to give fair scoring, only unique tweets per each movie were counted, 
 and sentences with neutral sentiment (usually advertisement or links) were not 
 counted towards calculating an average. This score was then averaged and scaled 
 and added to a column in our matrix of features. 




 Since social media tweets are unstructured sentences that contain special 
 characters, mispelled words and slangs, thus sorting out meaningful sentences
 was a challenge. Also textblob library failed to provide accurate analysis on 
 sentences that contain both positive and negative words.




 The movie Straight Outta Compton receives average sentiment score of +30.