MovieSumPrediction

using cmu plot-genre dataset:

Predict applicable genres
Find similar movies

Notebooks

Notebooks contains 2 files:

Deep Learning (transfer learning on BERT)
Shallow Learning Exploration

Execution

python predict.py [input_file_name]

Process

I explored the data with simple graphs, using matplotlib and nltk. Using a visual inspection informed by genre count, I merged a few (semantically) duplicate ones. Note that the approach to finding similar movies is using a cosine similarity on the embedding of the plot text into the space determined by the requisite model.

Deep Learning

Multilabel Transformer finetuning

Modeling

This was done using a huggingface implimentation of tokenizer and BERT. Using a BERT Transformer, I created a multi-label problem by using a BinaryCrossentropy Loss on model output. The prediction is done using the sigmoid of the outputs, and thresholding the scores of the output vector to determine which genres matched. Alternatively you can rank and use the top k.

TODOs

Exploring finetuning the BERT Model on a Masked Language Model (or similar) objective would have given a better outcome, where the embedding vectors that we use for prediction, would be more attuned to the specific meaning in the context of movies. I believe that this is the reason the shallow learning did just as well or better.

Shallow Learning

LogisticRegression (with various regularization), Naive Bayes Classification

Analysis Process

Here we actually cleaned out the stopwords ourselves, and used a TfIDf vectorization approach to embedding the plot text. For each of the shallow learners, a One Vs Rest Classifier is Trained. This means that for the 344 Genres, there was 1 model for each of them classifying genre vs not genre.

TODOs

Exploring the relationship between word apperance and genre would have been nice. Additionally, dropping the genres which only have a few samples would have drastically decreased the number of labels. In addition, visualizing the ROC and AU(ro)C would have been informative on how well the training data generalized to the testing data.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
models		models
notebooks		notebooks
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
predict.py		predict.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MovieSumPrediction

Notebooks

Execution

Process

Deep Learning

Modeling

TODOs

Shallow Learning

Analysis Process

TODOs

About

Releases

Packages

Languages

dataartist-og/MovieSumPrediction

Folders and files

Latest commit

History

Repository files navigation

MovieSumPrediction

Notebooks

Execution

Process

Deep Learning

Modeling

TODOs

Shallow Learning

Analysis Process

TODOs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages