Rating's Impact on Gross Earnings for Disney Movies

This project explores how a Disney movie's rating impacts its profitability.

The dataset provides information about 579 movies which were produced by the Disney company between 1937 and 2016. It gives the title of the movie, its release date, genre and rating. It also includes the total gross made by the movie and the inflation adjusted gross.

The end goal of this project is to make a recommendation to the Disney company on which rating helps the company earn more money that is if there is any correlation.

Reading and Exploring the Data

The data is read from the file titled 'disney_movies_total_gross.csv'.

#Load the necessary libraries and read the data from the file titled 'disney_movies_total_gross.csv'.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

gross = pd.read_csv('disney_movies_total_gross.csv', parse_dates=['release_date'])
gross.head(10)

	movie_title	release_date	genre	mpaa_rating	total_gross	inflation_adjusted_gross
0	Snow White and the Seven Dwarfs	1937-12-21	Musical	G	184925485	5228953251
1	Pinocchio	1940-02-09	Adventure	G	84300000	2188229052
2	Fantasia	1940-11-13	Musical	G	83320000	2187090808
3	Song of the South	1946-11-12	Adventure	G	65000000	1078510579
4	Cinderella	1950-02-15	Drama	G	85000000	920608730
5	20,000 Leagues Under the Sea	1954-12-23	Adventure	NaN	28200000	528279994
6	Lady and the Tramp	1955-06-22	Drama	G	93600000	1236035515
7	Sleeping Beauty	1959-01-29	Drama	NaN	9464608	21505832
8	101 Dalmatians	1961-01-25	Comedy	G	153000000	1362870985
9	The Absent Minded Professor	1961-03-16	Comedy	NaN	25381407	310094574

Since the movies' release dates encompass such a large span of time, the inflation adjusted gross column will be used to make comparisons among the movies. It will be referred to as gross, and the total gross column will be removed from the dataset to avoid any confusion.

#Remove 'total_gross' then rename 'inflation_adjusted_gross' as just 'gross'
gross.drop(['total_gross'], axis=1, inplace=True)
gross.rename(columns={'inflation_adjusted_gross':'gross'}, inplace=True)
gross.head()

	movie_title	release_date	genre	mpaa_rating	gross
0	Snow White and the Seven Dwarfs	1937-12-21	Musical	G	5228953251
1	Pinocchio	1940-02-09	Adventure	G	2188229052
2	Fantasia	1940-11-13	Musical	G	2187090808
3	Song of the South	1946-11-12	Adventure	G	1078510579
4	Cinderella	1950-02-15	Drama	G	920608730

The following code provides the number of movies released per year.

From the 1930s to mid 1980s, there were fewer movies released per year. Also society's norms change over the years. People's tendencies over that period of time may not be reflective of our current society. Since these movies may be irrelevant, we may have to omit these years in the final analysis. We will decide on this by investigating the data in more detail.

Also, we see that Disney was more productive during 1990's and 2000's, so movies produced during these years may play more role in our analysis.

#To explore the data by year, extraction of year from the release date is necessary.
gross['year'] = gross['release_date'].dt.year
gross['year'].value_counts().sort_index()

1937     1
1940     2
1946     1
1950     1
1954     1
1955     1
1959     1
1961     3
1962     1
1963     1
1967     1
1968     1
1970     2
1971     1
1975     1
1977     4
1979     1
1980     2
1981     4
1982     3
1983     4
1984     2
1985     6
1986     7
1987    10
1988    12
1989    11
1990    15
1991    16
1992    22
1993    27
1994    30
1995    32
1996    28
1997    23
1998    22
1999    21
2000    19
2001    14
2002    22
2003    19
2004    19
2005    17
2006    19
2007    14
2008    13
2009    16
2010    14
2011    14
2012    10
2013    11
2014    12
2015    11
2016    14
Name: year, dtype: int64

Cleaning the Data

Next, we will shift our focus to cleaning the data.

Exploration of the missing values in the rating column reveals that 56 movies has missing ratings, while 3 movies are rated as 'Not Rated'.

#Exploration of missing values: Find total number of missing values in each column. 
#Also find out if rating column has any other labels for missing values. In this case 'Not Rated' was used for missing values.  
print(gross.isna().sum())
gross['mpaa_rating'].value_counts()

movie_title      0
release_date     0
genre           17
mpaa_rating     56
gross            0
year             0
dtype: int64

PG           187
PG-13        145
R            102
G             86
Not Rated      3
Name: mpaa_rating, dtype: int64

Those movies with missing ratings need to be removed from the dataset, as they have no contribution to our decision making. The dataset has now 520 movies.

#Removing movies with missing ratings
print(gross.shape)
gross = gross[gross['mpaa_rating']!='Not Rated']
gross.dropna(subset=['mpaa_rating'],inplace=True)
gross.shape

(579, 6)

(520, 6)

Exploratory Data Analysis

Before doing any kind of calculations on the data, plotting the average gross per decade for each rating will help us understand whether there is irrelevant information that will skew our analysis.

#Group the dataset by rating. Then calculate the average gross by year for each rating.
group = gross.groupby(['mpaa_rating','year'])['gross'].mean()
rating_yearly = group.reset_index()
rating_yearly.head()

	mpaa_rating	year	gross
0	G	1937	5.228953e+09
1	G	1940	2.187660e+09
2	G	1946	1.078511e+09
3	G	1950	9.206087e+08
4	G	1955	1.236036e+09

The following plot gives us an insight into how the average gross has fared depending on the rating during the 80 years of Disney making movies.

# Plot the ratings by gross over about 80 years
sns.relplot(kind='line', x='year', y='gross', data=rating_yearly, hue='mpaa_rating')

<seaborn.axisgrid.FacetGrid at 0x1377f251910>

As mentioned before, only movies made after 1985 seem to shed a light on how the movies are being rated and how much gross they make as of late. So it would be a good decision to focus on only movies made after 1985.

Just by inspecting the graph, we see that 'G' rated movies used to be making more money than 'PG' or 'PG-13' rated, although it seems to be the other way around lately.

'R' rated movies doesn't bring in much revenue unsurprisingly, as Disney company is known for its family oriented movies.

We will apply Linear Regression to see if our takeaways hold any truth. First, we will get rid of movies made before 1985.

#Include only movies made after 1985
gross = gross[gross['year']>=1985]
gross['year'].value_counts()

1995    32
1996    28
1994    28
1993    27
1997    23
1992    22
1998    22
2002    21
1999    21
2006    19
2003    19
2000    19
2004    18
2005    17
2009    16
1991    14
1990    14
2016    14
2001    14
2007    14
2010    14
2011    14
2008    13
2014    12
2015    11
2013    11
2012    10
1989     9
1988     7
1987     4
1986     3
Name: year, dtype: int64

Applying Linear Regression

#To use 'ratings' in Linear Regression, we need to create binary numbers representing the rating categories. 
rating_dummies = pd.get_dummies(gross['mpaa_rating'], drop_first=True)
rating_dummies.head()

	PG	PG-13	R
45	0	0	1
47	0	0	1
51	0	0	1
52	0	0	1
57	0	0	1

#Run the Linear Regression on the dataset's rating and gross data.  
regr = LinearRegression()
regr.fit(rating_dummies, gross['gross'])
#Get the coefficients corresponding to the ratings
G = regr.intercept_
PG = regr.coef_[0]
PG_13 = regr.coef_[1]
R = regr.coef_[2]

print('G: {}'.format(round(G,2)))
print('PG: {}'.format(round(PG,2)))
print('PG_13: {}'.format(round(PG_13,2)))
print('R: {} '.format(round(R,2)))

G: 135923947.95
PG: -34680988.7
PG_13: -32975363.95
R: -80618142.4

From result we see that, application of Linear Regression proves our foresight that 'G' rated movies bring in more revenue as there is a positive correlation.

Also as we suspected from the graph before, there is a strong negative correlation between 'R' rated movies and their gross.

There seems to be a negative correlation between 'PG' and 'PG 13' rated movies and the revenue they make, though we need to take a closer look.

In any case, we can not be entirely sure as of yet as the dataset is fairly small and we ran Regression only once. To make sure, we will use a technique called bootstrapping, where we resample the data and run Linear Regression on it many times to see if there is any fluctuation in the coefficients produced by the Regressions.

If we have some coefficients ending up being zero, then we can conclude that that particular rating does not affect the revenue.

We will resample the data randomly and run Linear Regression on it 500 times. We will save those 500 coefficients of each rating in 4 lists (numpy arrays). Then look into those numbers both statistically and visually to see if there are any zeroes in them.

The following code creates indices for resampling and 4 empty arrays to save the replicates of coefficients for each rating.

#Create indices for resampling
inds = np.arange(len(gross['mpaa_rating']))
size = 500
#create four empty numpy arrays to save those coefficients produced by the Linear Regression
g_reps = np.empty(shape=size)
pg_reps = np.empty(shape=size)
pg_13_reps = np.empty(shape=size)
r_reps = np.empty(shape=size)

Finally, the following code randomly shuffles the indices, runs the Linear Regression on the 'shuffled' dataset and saves the newly created coefficients for each rating in arrays.

#Randomly shuffle the data 500 times. For each case, run the Linear Regression and get the coefficients correspoding to the ratings.
for i in range(size):
    bs_inds = np.random.choice(inds, size=len(inds))
    rating = gross.iloc[bs_inds,:]['mpaa_rating']
    in_ad_gross = gross.iloc[bs_inds,:]['gross']
    rating_dummies = pd.get_dummies(rating, drop_first=True)
    regr = LinearRegression()
    regr.fit(rating_dummies, in_ad_gross)
    g_reps[i] = regr.intercept_
    pg_reps[i] = regr.coef_[[0]][0]
    pg_13_reps[i] = regr.coef_[[1]][0]
    r_reps[i] = regr.coef_[[2]][0]

We will take a look at those rating coefficients statistically. For each set, we will compute the 95% confidence interval and see if it includes any zero values.

#Compute 95% intervals to see if they include any zeroes.
confidence_interval_g = np.percentile(g_reps, [2.5, 97.5])
confidence_interval_pg = np.percentile(pg_reps, [2.5, 97.5])
confidence_interval_pg_13 = np.percentile(pg_13_reps, [2.5, 97.5])
confidence_interval_r_reps = np.percentile(r_reps, [2.5, 97.5])
print('confidence_interval for G rated movies: {}'.format(confidence_interval_g))
print('confidence_interval for PG rated movies: {}'.format(confidence_interval_pg))
print('confidence_interval for PG-13 rated movies: {}'.format(confidence_interval_pg_13))
print('confidence_interval for R rated movies: {}'.format(confidence_interval_r_reps))

confidence_interval for G rated movies: [1.08711542e+08 1.69848105e+08]
confidence_interval for PG rated movies: [-70682058.53836423  -2798831.84612276]
confidence_interval for PG-13 rated movies: [-71333137.180662     5183160.02211086]
confidence_interval for R rated movies: [-1.1786631e+08 -5.0322721e+07]

We see that 'G' rated movies seem to have a positive correlation and 'R' rated movies have a negative correlation. 'PG-13' may have 0 values as it goes from negative values to positive values. For 'PG' we can not be sure. We have to look at all values. We will do this visually in the next part. The following code creates 4 graphs for each rating.

%matplotlib inline
color = ['blue', 'red', 'green', 'orange']
name=['G rated', 'PG rated', 'PG 13 rated', 'R rated']
for i,j in enumerate([g_reps, pg_reps, pg_13_reps, r_reps]):
    plt.scatter(x=np.arange(500), y=j, label=name[i], color=color[i], alpha=0.5)
    plt.xlabel('Cases')
    plt.ylabel('Correlation')
    plt.legend().get_frame().set_alpha(alpha=0.5)
    plt.show()

For 'G' rating, as we suspected, there are no zero values. Since all values are positive, we can infer that a movie being 'G' rated certainly increases its revenue. The reason for this may be, people can go see these movies as a whole family. Naturally they pay more.

On the contrary, for 'R' rated movies, we see that there is a negative correlation in all cases and there are no zero values which means in no case was there a lack of correlation. The reason for the negative correlation may be people do not bring their children to those movies, so they do not pay as much. Some people may prefer other movie companies if they were inclined to watch an 'R' rated movie.

As for 'PG' and 'PG-13' rated movies, we do see many zero values for many cases. So we conclude that for those cases, there is no correlation between a movie being rated 'PG' or 'PG-13' and the revenue it brings in. For others, it seems like negatively correlated cases overpowers positively correlated cases, but still we can not draw a clear conclusion.

Finally, considering all the evidence, I would advise the Disney company to make more movies targeting the General audience.

'PG' and 'PG-13' rated movies may occasionally bring in high revenue, but chances are they may not do as well as the 'G' rated ones.

I certainly would not recommend Disney to produce 'R' rated movies based on the data.

How does a Movie's Rating Affect its Revenue?

Rating's Impact on Gross Earnings for Disney Movies

Reading and Exploring the Data

Cleaning the Data

Exploratory Data Analysis

Applying Linear Regression

Made with REPL Notes Build your own website in minutes with Jupyter notebooks.