Back to projects
Mar 01, 2022
2 min read

Movies Correlation Analysis

Python-based statistical analysis of what drives box office revenue, using Pandas, Seaborn, and Matplotlib on a Kaggle movies dataset.

A data analysis project investigating which factors most strongly predict a movie’s gross box office earnings, using Python and the scientific stack in a Jupyter Notebook.

Key Finding

Budget has the strongest positive correlation with gross revenue. Vote count came in as the second-highest predictor — initially surprising, but logical: the more people who see a film, the more votes it accumulates on review platforms.


Dataset

Initial Dataset

Data Types Check

Cleaned Dataset

Sorted by Gross Revenue

Sorted by Gross

Correlation Analysis

Budget vs Gross Scatter

Budget vs Gross with Trend Line

Score vs Gross Correlation

Correlation Matrix

Correlation Matrix Table

Correlation Heatmap

Numerized Correlation Heatmap


Tech Stack

LayerTechnology
LanguagePython
EnvironmentJupyter Notebook
Data manipulationPandas, NumPy
VisualisationSeaborn, Matplotlib
Data sourceKaggle — danielgrijalvas/movies

Methodology

  1. Import & inspect — loaded CSV, checked shape, dtypes, and null counts
  2. Clean — converted float columns to integers where appropriate, handled missing values
  3. Explore — sorted by gross revenue to surface top performers
  4. Correlate — generated numeric correlation matrix, then encoded categorical variables for inclusion
  5. Visualise — scatter plots with trend lines, regression plots, correlation heatmap

Key Skills Demonstrated

  • End-to-end exploratory data analysis workflow in Python
  • Statistical correlation analysis with proper handling of categorical variables
  • Data cleaning and type normalisation with Pandas
  • Clear visual communication of statistical findings with Seaborn/Matplotlib