Predicting Movie Popularity in Indonesia Based On Metadata Using Gradient Boosting

Roiqoh, Aprinia Salsabila (2026) Predicting Movie Popularity in Indonesia Based On Metadata Using Gradient Boosting. Undergraduate thesis, UPN Veteran Jawa Timur.

[img]
Preview
Text (Cover)
22081010166.-cover.pdf

Download (3MB) | Preview
[img]
Preview
Text (BAB 1)
22081010166.-bab1.pdf

Download (1MB) | Preview
[img] Text (BAB 2)
22081010166.-bab2.pdf
Restricted to Repository staff only until 26 May 2028.

Download (4MB)
[img] Text (BAB 3)
22081010166.-bab3.pdf
Restricted to Repository staff only until 26 May 2028.

Download (4MB)
[img] Text (BAB 4)
22081010166.-bab4.pdf
Restricted to Repository staff only until 26 May 2028.

Download (7MB)
[img]
Preview
Text (BAB 5)
22081010166.-bab5.pdf

Download (431kB) | Preview
[img]
Preview
Text (Daftar Pustaka)
22081010166.-daftarpustaka.pdf

Download (923kB) | Preview
[img] Text (Lampiran)
22081010166.-lampiran.pdf
Restricted to Repository staff only

Download (825kB)

Abstract

The film industry in Indonesia has experienced significant growth; however, the success of a film in attracting audiences remains difficult to predict accurately. This study aims to develop a model for predicting the number of moviegoers in Indonesia based on pre-release metadata using gradient boosting algorithms, namely XGBoost, LightGBM, and CatBoost. The dataset was collected from Cinepoint and TMDb, consisting of 3,464 initial records, which were reduced to 2,595 after the preprocessing stage. The preprocessing steps included data cleaning, selective handling of missing values, logarithmic transformation of the target variable, and feature engineering using a Bayesian smoothing approach. The models were trained using two data split scenarios (80:20 and 70:30), and hyperparameter optimization was performed using Random Search and Bayesian Optimization (Optuna). Model performance was evaluated using RMSE, MAE, MAPE, and R² metrics. The results show that the best model was achieved by CatBoost with Random Search under the 80:20 data split scenario, yielding an R² value of 0.8729, MAE of 0.5538, RMSE of 0.7698, and MAPE 5,02%. These results indicate that CatBoost provides the most accurate and stable prediction performance compared to XGBoost and LightGBM. Furthermore, hyperparameter tuning was proven to improve model performance in predicting movie audience numbers. Feature importance and SHAP analysis reveal that the main actors, directors, and genres are the most influential features in the prediction results. This indicates that pre-release metadata plays a significant role in determining movie popularity in Indonesia.

Item Type: Thesis (Undergraduate)
Contributors:
ContributionContributorsNIDN/NIDKEmail
Thesis advisorParlika, RizkyNIDN0718058401rizkyparlika.if@upnjatim.ac.id
Thesis advisorAditiawan, Firza PrimaNIDN0023058605firzaprima.if@upnjatim.ac.i
Subjects: Q Science > QA Mathematics > QA76.87 Neural computers
Divisions: Faculty of Computer Science > Departemen of Informatics
Depositing User: Aprinia Salsabila Roiqoh
Date Deposited: 26 May 2026 01:40
Last Modified: 26 May 2026 01:40
URI: https://repository.upnjatim.ac.id/id/eprint/52617

Actions (login required)

View Item View Item