The goal of this data collection tool is to archive a user's behavior and the digital traces of their web experiences. We use a two point approach, involving the collection of ecological and controlled data. Ecological data includes a real user's behavior, their web account histories, and in some cases, real-time snapshots of the websites they visit. Controlled data includes snapshots of websites that we take from a user's computer, including select keyword searches on Google Search or YouTube.
This extension was used to collect data for the following studies:
- Robertson, R. E., Green, J., Ruck, D. J., Ognyanova, K., Wilson, C., & Lazer, D. (2023). Users choose to engage with more partisan news than they are exposed to on Google Search. Nature, 618, 342–348. DOI: 10.1038/s41586-023-06078-5
- Chen, A. Y., Nyhan, B., Reifler, J., Robertson, R. E., & Wilson, C. (2023). Subscriptions and external links help drive resentful users to alternative and extremist YouTube channels. Science Advances, 9(35). DOI: 10.1126/sciadv.add8080
- Gleason, J., Hu, D., Robertson, R. E., & Wilson, C. (2023). Google the gatekeeper: How search components affect clicks and attention. Proceedings of the International AAAI Conference on Web and Social Media (ICSWM 2023), 17, 245–256. DOI: 10.1609/icwsm.v17i1.22142
We used a browser extension in order to collect four types of data about users’ web browsing.
- Where people go (monitor API - passive)
Collection: tracking the meta data of users’ path through the web, including the URLs they visit and when they visited them.
Scope: absolute – we record everything (i.e. independent browser history). - What people saw (snapshot API - passive)
Collection: saving a copy of the HTML that was rendered when a user visited a URL.
Scope: filtered to trigger on a set of pre-selected web domains, including, Google Search, YouTube, Facebook Newsfeed, Twitter Feed.
- What people would have seen (snapshot API - active)
Collection: visiting URLs from users’ computers and saving a copy of the HTML that renders.
Scope: limited to pre-selected URLs (e.g. the YouTube homepage or a fixed Google Search) or dynamically discovered URLs (e.g. recursive algorithm interrogation). Can collect in a normal and a private window to measure personalization. - What websites know about people (history API - active)
Collection: automating the collection of data from various web services and accounts, including their browsing history, Google account, and ad preferences.
Scope: limited to pre-selected accounts and services, and by the kinds of data those services provide access to.
The browser extension is written in JavaScript, HTML, CSS, and the WebExtensions API, which is compatible with Firefox and Chrome. It consists of APIs, Web Workers, and a system for communicating between them.
Throughout the project, we use jsdoc
to document the browser extension code. To rebuild the documentation, install node
and jsdocs
, then run: bash ./scripts/make_docs.sh
We used a Flask app built to receive and store data sent from extension in a MySQL database. Data storage format is specified in application/models.py
and matches the format above.
To start the server for local testing:
- Create a virtual environment (
python3.6
) and installrequirements.txt
- Open
./scripts/start_plaform.sh
and changeFLASK_ENV
to your virtualenv path - Run
bash ./scripts/start_platform.sh
- Flask will begin running on http://0.0.0.0:80/ and logging to the console
- Data targeted at the server (e.g. http://x.y.z.a:80/save_data) will be saved to a SQL database specified in
./sqlplatform/config.py