Welcome to the central intelligence hub. Select a module below to analyze movie performance and audience personas.
Instead of building separate scrapers or API integrations for each platform, a centralized source is used: the OMDb API. A single API call per movie returns ratings from IMDB, Rotten Tomatoes, and Metacritic simultaneously β clean, structured, and reliable.
For each movie in the target list, the OMDb endpoint is called and the response is parsed into a raw record:
{
"Title": "Inception",
"imdbRating": "8.8", // IMDB β out of 10
"Metascore": "74", // Metacritic β out of 100
"Ratings": [
{"Source": "Rotten Tomatoes", "Value": "87%"}
]
}
This is written to data/raw/movies.csv β the raw landing table:
| Title | IMDB | RottenTomatoes | Metacritic |
|---|---|---|---|
| Inception | 8.8 | 87% | 74 |
| Titanic | 8.0 | 88% | 75 |
| The Avengers | 8.0 | 91% | 69 |
The three platforms use different scales. Everything is normalized to a 0β100 common scale before combining:
β’ IMDB_norm = IMDB Γ 10 β 8.8 becomes 88
β’ RottenTomatoes_norm β strip the %, already 0β100 β 87
β’ Metacritic_norm β already 0β100 β 74
| Title | IMDB_norm | RottenTomatoes_norm | Metacritic_norm |
|---|---|---|---|
| Inception | 88.0 | 87.0 | 74.0 |
| Titanic | 80.0 | 88.0 | 75.0 |
| The Avengers | 80.0 | 91.0 | 69.0 |
The Movie Score is the simple average of the three normalized scores, written to data/processed/movie_scores.csv:
df['MovieScore'] = df[['IMDB_norm', 'RottenTomatoes_norm', 'Metacritic_norm']].mean(axis=1)
| Title | MovieScore |
|---|---|
| Inception | 83.0 |
| Titanic | 81.0 |
| The Avengers | 80.0 |
All three steps are wired in a DAG with a @daily schedule running inside Docker Compose:
extract_data β transform_data β compute_ms
Every day, new scores are fetched automatically and the MS updates without any manual intervention.
Since an unreleased film has no IMDB, Rotten Tomatoes, or Metacritic scores yet, a machine learning regression model is trained on historical data from films that have already been released.
A table combining pre-release metadata with the actual MS each film eventually received:
| film_id | genre | director_avg_ms | lead_actor_avg_ms | budget_usd | studio_tier | trailer_views_7d | release_season | actual_ms |
|---|---|---|---|---|---|---|---|---|
| tt1234 | Sci-Fi | 81.2 | 77.5 | 165M | A | 12.4M | Summer | 83.0 |
β’ Director & cast history β average MS of their previous films
β’ Studio quality score β the studio's average MS over the past 5 years
β’ Production budget β higher budgets tend to correlate with higher production quality
β’ Trailer engagement β YouTube/social media views and like ratios in the first week
β’ Genre β different genres have different historical MS distributions
β’ Release timing β summer blockbuster vs. awards season (NovβDec)
β’ Screenwriter track record β same logic as director history
β’ Random Forest / XGBoost β captures non-linear feature interactions, interpretable via feature importance
β’ Ridge Regression β robust against overfitting when data is limited
β’ LightGBM β fast and highly effective on larger historical datasets
Rather than a single number, the model outputs a range:
Predicted MS: 76 Β± 5 (95% confidence interval: 71β81)
As the release date approaches, the prediction is progressively refined with new signals: critics' screening reactions, film festival reception (Sundance, Cannes, TIFF), social media sentiment analysis, and advance ticket sales data.
A new DAG monitors an "upcoming films" table. When a new entry appears, it triggers feature collection, runs the model, and writes a predicted_ms column. Once the film is released and the real MS is computed, a prediction_error column is populated β providing a continuous feedback loop to retrain and improve the model over time.
| Part 1 | Part 2 | |
|---|---|---|
| Data | Live API scores (IMDB, RT, Metacritic) | Historical MS + pre-release metadata |
| Processing | Normalize β Average | Feature engineering β ML regression |
| Output | Exact MS | Predicted MS with confidence interval |
| Orchestration | Airflow DAG (@daily) | Airflow DAG (event-triggered) |
| Storage | movies_clean.csv β movie_scores.csv | upcoming_films β predicted_ms column |