Skip to content

nikals99/sponsorblock-net

Repository files navigation

SponsorBlock Net

Advertising on YouTube used to be played as a separate video before, during or after the actual video. Since this is a separate video, conventional ad blockers can remove these videos. In the meantime, however, advertising is no longer just played in separately, but integrated directly into the actual video by the creators. So that these integrated advertising segments can still be skipped, the browser plugin SponsorBlock was developed. The basis for skipping a segment in a video is provided by manual markings of SponsorBlock users. In this work, we investigate whether the marking of advertising segments can be automated.

Attribution

Uses SponsorBlock data licensed used under CC BY-NC-SA 4.0. More details: https://sponsor.ajay.app/

Running the Project

Directory structure

  • bag_of_words => bag of words based models (naive_bayes + logistic_regression)
  • bert => bert based model
  • data_analysis => python notebook to analyze the data using pandas
  • data_explorer => flask application for visually evaluating each models performance
  • data_split => script for splitting data into training, validation and test sets
  • evaluation => script for evaluating each models performace on test data (metrics: precision, f1, recall)
  • transcript_fetcher => scripts for downloading transcripts and metadata from youtube
  • video => video based model

Video dataset

The Video dataset is a directory that contains the video files as well as the corresponding metadata files. The structure is as follows:

data_video/
  meta.json
  training.json
  validation.json
  test.json
  out/

The out/ directory contains the downloaded videos in a tree like structure (i.e. out/000/000/000/<videoId>.webm) and each entry in the metadata files contains a mapping from the videoId to the relative path of the video file.

Setup

  1. setup a virtualenv: python3 -m venv /path/to/new/virtual/environment
  2. activate the virtual environment: source /path/to/new/virtual/environment/bin/activate
  3. install all required packages: pip install -r requirements.txt
  4. start with the data preperation steps below

Data preparation

Analysis

To analyze the data a python notebook is located in data_analysis/. It is prepolutated with some metrics that have been used to determine the set of usable videos.

Transcript Fetcher

Transcriptfetcher downloads english transcripts and metadata for all video-urls provided in a videoids.txt. The file should have the following format

https://www.youtube.com/watch?v=ATMi30bGEVc
https://www.youtube.com/watch?v=0d0R-pwnrYA
https://www.youtube.com/watch?v=Kh7z0pGwAqY
https://www.youtube.com/watch?v=kPhfMUqzAqM
https://www.youtube.com/watch?v=UTUMCJomz7U
https://www.youtube.com/watch?v=RT2SWdw7L1g
https://www.youtube.com/watch?v=jDJSUclgDG8
https://www.youtube.com/watch?v=LRfP8r9u3W0
https://www.youtube.com/watch?v=xLDBnytcdDw
https://www.youtube.com/watch?v=Yn55w-8bfTY

and needs to be placed inside the transcript_fetcher directory. Inside the directory simply call python main.py and the script will download transcripts and metadata for all provided videos at a rate of ~1 video per second. Transcript and metadata are will be placed in the data subdirectory. To not overload the os filesystem, all data is arranged in a tree like structure. Furthermore a logfile will be written in which all videos + paths are persisted. To simplify dataloading a metadata file containing all videos is helpful. By running python generate_meta.py such a file can be generated from the logfile generated in the step before.

An single entry looks like this contains general information about a video as well as sponsortimes from sponsorblock:

{
    "id": "MMD0pBs4DlU",
    "directory": "out/000/024/029/MMD0pBs4DlU",
    "channel": "Stephen Hawes",
    "channel_id": "UCMf49SMPnhxdLormhEpfyfg",
    "duration": 629,
    "title": "Running Multi-Feeder Jobs!",
    "subtitles_en": false,
    "captions_en": true,
    "sponsorTimes": [
      {
        "startTime": 535,
        "endTime": 567
      },
      {
        "startTime": 537,
        "endTime": 567
      },
      {
        "startTime": 534,
        "endTime": 567
      }
    ]
  }

Data split

The data was split into training, validation and test sets (distribution: 70%, 20%, 10%). The splits respects channels so that channels only occur in one dataset to avoid overfitting. To split the data yourself you first need to generate a list of all channels and their amount of videos. This can be done by first executing the following command inside of the data directory:

cat meta.json | jq ". [] | .channel_id" | sort | uniq -c | sort -nr > channels.txt

and then running python data_split.py inside the data_split directory. To ceck the distribution of the split execute:

cat <dataset_name>.json | jq ". [] | .channel_id" | sort | uniq -c | sort -nr | wc -l

Training

Bag of words

There are two bag of words based classifiers available (naive_bayes and logistic_regression). Both classifiers are trained on the training dataset by simply calling python naive_bayes_sk.py --train or python logistic_regression_sk.py --train. Training generates a dump of the model which later can be used for evaluation.

BERT

The BERT based model gets trained by pytorch lightning so running it is simple as well. To make sure you are running on the correct gpu (and only on one gpu) set the export CUDA_VISIBLE_DEVICES=<DEVICE_ID> environment variable. Then you can start training by running python classifier.py inside the bert directory. This trains your model and saves it in your current directory. When training you can choose the length of snippets given to BERT and if you want to finetune the Transformer layer.

Frame based classification

The model to classify individual frames is based on pytorch lightning as well. To make sure you are running on the correct gpu (and only on one gpu) set the export CUDA_VISIBLE_DEVICES=<DEVICE_ID> environment variable.

Before the training can be executed training-, validation-, and test data has to be preloaded from the video dataset. This can be done by configuring the parameters in video/preload.py and then running python video/preload.py <video_data_path> where <video_data_path> is the path to the directory that contains the metadata and video files. This will result in three pickle files containing the preloaded data.

Training can then be started by running python video/main.py. During training, pytorch lightning saves the best three models to ./auto_model_checkpoints/.

You can also search for an optimal set of hyperparameters using ray by running python video/tune_hyperparams.py.

Evaluation

Bag of Words

Both bag of words model provide an easy to use command line interface for evaluation. Available options (replace model.py with naive_bayes_sk.py or logistic_regression_sk.py :

  • python model.py --apply: Apply the model to a single, randomly chosen video and print each snippet with ground truth + prediction. Pressing any key restarts the process with a new video
  • python model.py --validate_boxes: Applies the model to all videos from the validation dataset and reports snippets sorted by sponsor probability
  • python model.py --validate_auto: Applies the model to all videos from the validation dataset and reports precision, recall and f1 for various thresholds

BERT

There are two different ways to evaluate your BERT based model. To view current metrics (precision, recall, f1) simply run the training and open tensorboard (tensorboard --logdir ./lightning_logs). In tensorboard the current metrics on the validation set are shown. The other way of evaluating BERT is through the evaluate.py script inside the bert directory. It behaves similar to the --apply function from the bag of words based models.

Frame based classification

During the training of the frame based classification model log data is saved to ./lightning_logs. You can use tensorboard to view the logs by running tensorboard --logdir ./lightning_logs.

Evaluation on test data

To evaluate a trained model on the testdata you can use the evaluation script in the evaulation/ directory. This script evaluates both the BERT and frame based classification models as well as the fusioned result using precision, recall and F1 metrics. Additionally this script computes the ROC of all three models and saved the resulting data as well as a plot to the current directory.

Data explorer

Data explorer is a simple flask application that can be used to view and evaluate samples from the validation set. It can be run by executing run.sh inside of the data_explorer directory. Before running, make sure that you set your CUDA_VISIBLE_DEVICES correctly and have already trained both the video and the text based model. Since the tokenizer from BERT has problems with parallelism set export TOKENIZERS_PARALLELISM=false to avoid excessive warnings. After starting data explorer simply open your browser at http://localhost:5543. Data explorer will choose a random video which is shown at the top of the page. Below the video ground truth as well as predictions can be found. Each 5 pixels from top to bottom represents 1 second of the video with green being normal video and red being a sponsor segment. An example can be seen below.

data_explorer

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published