Predicting successful movies on IMDB

To help better assess which movies to bring into the Netflix streaming catalog, I have pulled data for the best and worst movies in the IMDB database and created two models from that data to predict ratings for other movies. A total of 350 records were downloaded: the top 250, and the bottom 100.

I collected data from both the IMDB website and the OMDB API, including columns for:

Ratings from:
- IMDB
- Metacritic
- Rotten Tomatoes
Number of IMDB user ratings
Number of user reviews on IMDB
Number of critic reviews on IMDB
Gross revenue
Actors
Awards
Country
DVD release data
Director
Genre
Languages
Plot synopsis
Producer
Release date
Running time
Writer
Year

Many of the columns (Plot and Actors, for example) were parsed for key word entries. I kept up to 50 terms from each key word search, and created new features recording the presence of those key words and the column it was found in (e.g. “director_sidney lumet” and “country_france”).

Feature Engineerings

v = CountVectorizer(
            binary= True,
            token_pattern= u'(?u)\w+.?\w?.? \w+',
            max_features=50
    )
data = v.fit_transform(X.Actors).todense()
names = v.get_feature_names()

temp_df = pd.DataFrame(data,columns=names)

for i in temp_df.columns:
    X.insert(len(X.columns)-1,"actor_"+i,temp_df[i].values)
     


v = CountVectorizer(
        binary= True,
        token_pattern= '(?u)\w+.?\w+',
        max_features=50
    )
data = v.fit_transform(X.Country).todense()
names = v.get_feature_names()

temp_df = pd.DataFrame(data,columns=names)

for i in temp_df.columns:
    X.insert(len(X.columns)-1,"country_"+i,temp_df[i].values)
    
v = CountVectorizer(
            binary= True,
            max_features=50
    )
data = v.fit_transform(X.Plot).todense()
names = v.get_feature_names()

temp_df = pd.DataFrame(data,columns=names)

for i in temp_df.columns:
    X.insert(len(X.columns)-1,"plot_"+i,temp_df[i].values)   

Regression model

Performance

I first ran a random forest regressor on the data to create a predictive model for the IMDB rating based on the other data collected. The performance of the regressor can be seen in the figure below. The good and bad movie clusters are clearly visible, and largely predicted accurately. There are two points that were not fit well, however, indicating either the failure of the model, or the fickleness of the reviewers.

I ran a grid search over several parameters for a random forest model. The best model used 100 trees with 5 levels of depth and automatic feature selection. The R-squared performance metric for this is 0.9622 +/- 0.0244 across a cross validation of 3 KFolds, and 0.9496 on the holdout set (20% of the data).

plt.subplots(figsize=(12,8))
plt.scatter(y_test,randforest_pred)
plt.plot((min(y_test),max(y_test)),(min(y_test),max(y_test)), c='k')
plt.xlabel("Actual Values",fontsize=16)
plt.ylabel("Predicted Values",fontsize=16)
plt.title("Regression performance",fontsize=20)
plt.show()

png

Important features

Also shown in the results of the regression model are the relative importances of the features. The top five most important features in the model are shown in the table below. The number of ratings on IMDB is the strangest predictor in this set, followed by the ratings on other sites, the year of release, and the running time of the movie.

pd.DataFrame(zip(X_train.columns,randforest_gs.best_estimator_.steps[1][1].feature_importances_)\
                           ,columns=['feat','import']).sort_values('import',ascending=False).head()

	feat	import
3	imdbVotes	0.848560
4	Rotten Tomatoes	0.070706
0	Metascore	0.068943
2	Year	0.002933
1	Runtime	0.001881

Classification model

Of course the goal isn’t to rate the movies we consider, but rather to predict whether they are popular or not. To answer this question, I made a classification model with a random forest classifier. The fitting procedure is nearly identical. The same features were used, but I replaced the IMDB rating with a flag (1 for movies from the top movies list and 0 for the worst movies) as the target.

Performance

Accuracy becomes the metric for classification problems, and several items from my grid search achieved the highest average score, but the gs_best_predictor selected had higher variance, so the other models should also be explored.

Regardless, I was able to achieve an accuracy of 0.97 on the test set, and the ROC and PR curve areas are both very high. The confusion matrix shows that only 2 of the 70 holdout movies were classified incorrectly.

rf_class_gs.score(X_test_class,y_test_class)

0.97142857142857142

print classification_report(y_test_class,rf_class_pred)
print ""
print pd.DataFrame(confusion_matrix(y_test_class,rf_class_pred,labels=[1,0]),\
                   columns=['pred_1','pred_0'],index=['is_1','is_0'])


# use the predicted probablitites to calculate AUC of ROC and PRCurve
Y_score = rf_class_gs.best_estimator_.predict_proba(X_test)[:,1]


# For class 1, find the area under the curve
FPR, TPR, _ = roc_curve(y_test_class, Y_score)
ROC_AUC = auc(FPR, TPR)

PREC, REC, _ = precision_recall_curve(y_test_class, Y_score)
PR_AUC = auc(REC, PREC)

# Plot of a ROC curve for class 1 (has_cancer)
plt.figure(figsize=[11,9])
plt.plot(FPR, TPR, label='ROC curve (area = %0.2f)' % ROC_AUC, linewidth=4)
plt.plot(REC, PREC, label='PR curve (area = %0.2f)' % PR_AUC, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive or Recall Rate', fontsize=18)
plt.ylabel('True Positive or Precision Rate', fontsize=18)
plt.title('Random Forest for Loan Approval', fontsize=18)
plt.legend(loc="lower right")
plt.show()

             precision    recall  f1-score   support

          0       1.00      0.90      0.95        20
          1       0.96      1.00      0.98        50

avg / total       0.97      0.97      0.97        70


      pred_1  pred_0
is_1      50       0
is_0       2      18

png

Feature importances

Several of the same features are important to the classifier as were important to the regressor. The number of IMDB votes in particular is still the most important feature. The number of reviews by critics and users are next, followed by the running time and then scores on other review sites. An interesting difference between the regressor and classifier is the reduced importance of outside ratings in making classification decisions.

print ""
print pd.DataFrame(zip(rf_class_gs.cv_results_['mean_test_score'],\
                 rf_class_gs.cv_results_['std_test_score']\
                ), columns=["score","score_err"]).sort_values("score",ascending=False).head()
print ""
rf_imps = pd.DataFrame(zip(X_train_class.columns,rf_class_gs.best_estimator_.steps[1][1].feature_importances_)\
                           ,columns=['feat','import'])

print rf_imps.sort_values('import',ascending=False).head(10)

       score  score_err
0.989247   0.008710
0.989247   0.008803
0.989247   0.008803
0.989247   0.008803
0.985663   0.013536

                    feat    import
            imdbVotes  0.178194
  NumberCriticReviews  0.156770
    NumberUserReviews  0.124143
              Runtime  0.092799
      Rotten Tomatoes  0.061476
        genre_drama  0.051764
            Metascore  0.040188
                Gross  0.039270
       genre_horror  0.028926
       genre_comedy  0.019035

Next steps

First, there are several models that have very similar performance that I would like to explore further. I expect these other models to confirm the story seen in the two models presented here, or that there is another compelling message hidden in the data.

I am very interested in pulling more information from IMDB about the reviews submitted by users. In particular, I think that the proportion of users finding a review useful will become a strong information stream by which to judge users and reviews submitted for each movie. It may even be that several key user accounts can be identified as strategic markers on movie popularity… allowing for new movies to be classified before a critical mass of users have viewed and rated the movies on IMDB.

Written on April 22, 2017