This repository contains supplementary material for the paper An Audit of Misinformation Filter Bubbles on YouTube: Bubble Bursting and Recent Behavior Changes published at ACM RecSys 2021 (final version DOI: 10.1145/3460231.3474241, preprint DOI: 10.48550/arXiv.2203.13769) and an extended version of the paper titled Auditing YouTube's Recommendation Algorithm for Misinformation Filter Bubbles that has been accepted for publication at ACM TORS journal (final version DOI: 10.1145/3568392, preprint DOI: 10.48550/arXiv.2210.10085).
If you make use of any data or modules in this repository, please cite the following papers:
Matus Tomlein, Branislav Pecher, Jakub Simko, Ivan Srba, Robert Moro, Elena Stefancova, Michal Kompan, Andrea Hrckova, Juraj Podrouzek, and Maria Bielikova. 2021. An Audit of Misinformation Filter Bubbles on YouTube: Bubble Bursting and Recent Behavior Changes. In Fifteenth ACM Conference on Recommender Systems (RecSys '21). Association for Computing Machinery, New York, NY, USA, 1–11. DOI: https://doi.org/10.1145/3460231.3474241
Ivan Srba, Robert Moro, Matus Tomlein, Branislav Pecher, Jakub Simko, Elena Stefancova, Michal Kompan, Andrea Hrckova, Juraj Podrouzek, Adrian Gavornik, and Maria Bielikova. 2023. Auditing YouTube’s Recommendation Algorithm for Misinformation Filter Bubbles. ACM Transactions on Recommender Systems. 1, 1, Article 6 (March 2023), 33 pages. DOI: https://doi.org/10.1145/3568392
In this paper, we present results of an auditing study performed over YouTube aimed at investigating how fast a user can get into a misinformation filter bubble, but also what it takes to "burst the bubble", i.e., revert the bubble enclosure. We employ a sock puppet audit methodology, in which pre-programmed agents (acting as YouTube users) delve into misinformation filter bubbles by watching misinformation promoting content. Then they try to burst the bubbles and reach more balanced recommendations by watching misinformation debunking content. We record search results, home page results, and recommendations for the watched videos. Overall, we recorded 17,405 unique videos, out of which we manually annotated 2,914 for the presence of misinformation. The labeled data was used to train a machine learning model classifying videos into three classes (promoting, debunking, neutral) with the accuracy of 0.82. We use the trained model to classify the remaining videos that would not be feasible to annotate manually.
Using both the manually and automatically annotated data, we observe the misinformation bubble dynamics for a range of audited topics. Our key finding is that even though filter bubbles do not appear in some situations, when they do, it is possible to burst them by watching misinformation debunking content (albeit it manifests differently from topic to topic). We also observe a sudden decrease of misinformation filter bubble effect when misinformation debunking videos are watched after misinformation promoting videos, suggesting a strong contextuality of recommendations. Finally, when comparing our results with a previous similar study, we do not observe significant improvements in the overall quantity of recommended misinformation content.
To support any future research in the field of auditing adaptive systems for misinformation or other phenomena, we publish in this repository all source code, collected and annotated data as well as data analysis notebooks. However, due to ethical concerns (see Section 4.7 in ACM TORS paper), we do not publish automatic annotations predicted by the trained machine learning models. In addition, we do not publish metadata (such as titles, description or transcripts) for the collected YouTube videos (only YouTube IDs are included in the dataset). However, we provide the source code to retrain the machine learning models as well as means to download the metadata using YouTube API. Please also note that the reproducibility may suffer to some extent due to the dynamic nature of the platform, where some of the videos we used for seeding or encountered may no longer be available.
As to the machine learning models, we use two models from related works, namely by Hou et al. (2019) and by Papadamou et al. (2022). We provide our own implementation for the former which can be found in this notebook. For the latter, we reuse the source code published by the authors. The modified version of their source code that is able to work with our dataset and set of labels can be found in a separate GitHub repository.
This repository is structured in three folders:
- Code – source code for sockpuppeting bots
- Data – collected, processed and annotated datasets
- Notebooks – notebooks for data analysis containing results discussed in the paper
See the README file under the Code
folder to learn more.
Note: In our experiments, the bot was running in Google Chrome browser version 88, with chromedriver version 88.0.4324.96. The python version used was 3.8.7 with the Dockerfile being based on Debian version 10.7.
As adblock, we used uBlockOrigin, which is provided in the code as .crx
file.
We provide three CSV datasets with raw data (contained in raw_data
directory):
search_results.csv
containing annotated and processed top-20 results for queries executed after watching videos on YouTube.recommendations.csv
containing annotated and processed top-20 recommendations shown next to watched videos on YouTube.home_page_results.csv
containing collected and processed results from homepage visits executed after watching videos.
We provide four additional datasets with mapping of videos to their normalized labels (contained in normalized_data
directory):
encountered_videos.csv
containing normalized labels for the videos we encountered and then annotated during experiments. The file was obtained by running thenormalize-annotations.ipynb
notebook.seed_videos.csv
containing the videos we used as seed for running the experiments, along with their assigned labels and topics.train.csv
containing the manually labeled videos we used for training the models in the extended version of the paper. Onlyyoutube_id
andannotation
columns contain values; other columns need to be filled via YouTube API (it can be retrieved usingget-train-and-encountered-data.ipynb
notebook).videos_metadata.csv
containing the videos for which we were able to retrieve metadata. Onlyyoutube_id
,duration_seconds
,duration_minutes
,duration_hours
,encountered_home
,encountered_search
,encountered_recommend
, andencountered_all
columns contain values; other columns need to be filled via YouTube API (it can be retrieved usingget-train-and-encountered-data.ipynb
notebook).
We also provide two additional datasets that contain aggregated data that includes automatically generated predictions using a machine learning model (contained in predicted_data
directory):
recommendations_with_predicted_grouped.csv
containing misinformation score and ratio of annotated to automatically predicted labels for top-10 recommendations grouped by misinformation topic and sequence index within the experiment.home_page_with_predicted_grouped.csv
containing misinformation score and ratio of annotated to automatically predicted labels for home page results grouped by misinformation topic and sequence index within the experiment.
Each row represents one search result displayed on YouTube.
Please refer to the paper for discussion of annotation classes.
Column | Example | Description |
---|---|---|
youtube_id | nbmMwMQEK9Y | YouTube ID of the video in the search result |
bot_id | 5 | Identifier of the bot performing the search |
topic | 911 | Identifier of the conspiratory topic of videos the bot was watching and searching |
experiment | 911 | Identifier of the overall executed experiment (in this case, same as topic) |
query | 9/11 conspiracy | Search query used for these search results |
position | 13 | Position within the list of search results |
sequence_number | 219 | Ordering of this search action within all actions executed by the bot |
seed_sequence | 48 | Ordering of this search action within search actions executed by the bot (0 to 80) |
sequence_name | 48 | Label for ordering of this search action within search actions executed by the bot (0 to 80) |
annotation | 2 | Number code of the annotation given to the video with respect to the topic |
normalized_annotation | -1 | Number code of the annotation normalized to range -1 to 1 |
annotation_label | (2) debunking unrelated | Readable label of the annotation |
started_at | 2021-03-17 18:18:09.815451 | Timestamp of the search action |
Each row represents one recommended video displayed on YouTube in the top-20 recommendations beside watched videos.
Please refer to the paper for discussion of annotation classes.
Column | Example | Description |
---|---|---|
watched_youtube_id | 7aNnDjQxBNQ | YouTube ID of the watched video next to which the recommendations were displayed |
youtube_id | nJZBqmGLHQ8 | YouTube ID of the recommended video |
bot_id | 5 | Identifier of the bot watching the video |
topic | 911 | Identifier of the conspiratory topic of videos the bot was watching and searching |
experiment | 911 | Identifier of the overall executed experiment (in this case, same as topic) |
position | 9 | Position of the recommended video within list of recommendations |
sequence_number | 144 | Ordering of this video watching action within all actions executed by the bot |
seed_sequence | 32 | Ordering of this video watching action within video watching actions executed by the bot (0 to 80) |
sequence_name | 32 | Label for ordering of this video watching action within video watching actions executed by the bot (0 to 80) |
annotation | 5 | Number code of the annotation given to the recommended video with respect to the topic |
normalized_annotation | 0 | Number code of the annotation normalized to range -1 to 1 |
annotation_label | (5) not about misinfo | Readable label of the annotation |
normalized_label | other | Readable label of the annotation normalized to range -1 to 1 |
started_at | 2021-03-25 10:00:33.745248 | Timestamp of the video watching action |
Each row represents one homepage result displayed on YouTube.
Please refer to the paper for discussion of annotation classes.
Note: This dataset was not annotated. Some annotations are still present as a result of the videos also appearing in recommendations or search results.
Column | Example | Description |
---|---|---|
youtube_id | Ds390gg6Kqs | YouTube ID of the video appearing on homepage |
bot_id | 4 | Identifier of the bot performing the visit to homepage |
topic | chemtrails | Identifier of the conspiratory topic of videos the bot was watching and searching |
experiment | chemtrails | Identifier of the overall executed experiment (in this case, same as topic) |
position | 15 | Position within the list of homepage results, going from top left to bottom right |
sequence_number | 1 | Ordering of this homepage action within all actions executed by the bot |
seed_sequence | 0 | Ordering of this homepage action within search actions executed by the bot (0 to 80) |
sequence_name | A: start | Label for ordering of this homepage action within homepage actions executed by the bot (0 to 80) - in this case, corresponds to 0 |
annotation | -2 | Number code of the annotation given to the video with respect to the topic (in this case, the video was not annotated) |
normalized_annotation | Number code of the annotation normalized to range -1 to 1. Left empty as the video was not annotated | |
annotation_label | not annotated | Readable label of the annotation |
normalized_label | not annotated | Readable label of the annotation normalized to range -1 to 1 |
started_at | 2021-03-10 10:39:54.398890 | Timestamp of the homepage action |
The aggregated datasets for top-10 recommendations and home page results also consider automatically predicted annotations. Due to ethical risks, we only publish aggregated statistics.
Column | Example | Description |
---|---|---|
topic | chemtrails | Identifier of the conspiratory topic of videos the bot was watching and searching |
seed_sequence | 0 | Ordering of this action within all actions executed by the bot (0 to 80) |
score | 0.11 | Average number code of the annotation normalized to range -1 to 1 for the considered videos |
annotated | Ratio of manually annotated videos out of all considered. Labels for the remaining videos were automatically predicted using machine learning. |
There are the following Jupyter Notebooks contained in this folder:
rq1-compare-results-with-hussein.ipynb
contains analyses related to the first research question discussed in the paper.rq2-statistical-tests.ipynb
contains analyses related to the second research question discussed in the paper.rq2-trends.ipynb
contains visualizations of changes in misinformation scores over the experiments discussed in the paper and computation ofDIFF-TO-LINEAR
measure.normalize-annotations.ipynb
contains code for obtaining the normalized labels for the videos we annotated using the raw data.get-train-and-encountered-data.ipynb
contains code for downloading and processing videos' metadata and transcripts using YouTube's API.reimplemented-model-by-hou.ipynb
contains the reimplemented model by Hou et al. discussed in the extended version of our paper.videos-statistics.ipynb
contains code for computing descriptive statistics of the encountered videos presented in the paper.