Skip to content

epfl-ada/ada-2024-project-the5outliers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The5Outliers - Wikispeedia Voyages: Why so many people use World Regions to reach their target.

You can look at our website here.

Abstract

While playing Wikispeedia, we observed that we tend to adopt a strategy of navigating through articles about World Regions to reach the target, an approach we term Wikispeedia Voyages. A Voyage is defined as a path where neither the source nor target is in the World Regions category, but the path includes at least one article from this category. The Wikispeedia dataset collects both information about players' behaviour and the network structure of Wikipedia articles. Our aim is to understand to what extent the article and network structure, analyzed through Markov Chains, influences gameplay and the choice to undertake Voyages. In parallel, user difficulty in Voyages are also compared to other strategies, their alignment with shortest paths, as well as their insights from semantic similarity of article names along the paths.

Our results show that World Regions articles are highly connected, with a dense network of links. Moreover, a Markov Chains analysis showed that users navigate through this category more frequently than random walks in the network could suggest. Interestingly, a comparison of user paths with optimal (shortest) paths reveals that optimal paths leverage World Regions more often than users, suggesting that Voyages are effective strategies that players might underuse. Users achieve a higher success rate in reaching their targets when employing Wikispeedia Voyages, but seem to take longer and need more back-clicks, which could be due to the lesser semantic similarity observed throughout Voyages. However, only a small subset of articles within World Regions plays a particularly significant role in facilitating successful navigation, inviting to rather consider only larger countries, continents or regions for Voyages.

Research questions

  1. Does the Wikispeedia article and network structure intrinsically favour Wikispeedia Voyages? For example, are World Regions more numerous or more connected? Does the page structure of articles have an influence on Wikispeedia Voyages?
  2. Are users faster or more efficient when taking Wikispeedia Voyages, or do they take semantic detours that could complicate the path?
  3. How does the strategy compare with the algorithmic shortest paths?

Methods

HTML parsing

Parsing allows to find interesting features of the wikipedia articles: the number of words and the total number of links, taking in acount duplicates inside the same page. It also gives structural information about the pages: categories, subcategories, the presence and nature of tables. For each of these structures, the number of words and the list of present links is reported.

Networks and Graphs

Optimal (shortest) paths are determined by modeling the Wikispeedia article network as a directed graph, where nodes represent articles and edges represent hyperlinks between them. For each source-target pair present in the users' games, we compute all shortest paths, ensuring that every minimum-length route is considered.

Markov Chains

Markov Chains are used to model the influence of the network structure that could inherently bias user paths. Every article is assigned transition probabilities to all other articles based on the number of links present in this article. The transition matrix's $Q$ normalised left eigenvector with eigenvalue $1$ (i.e. $x$ such that $xQ=x$) gives the steady-state of the system (useful to find central articles of the network).

To compare with the user paths, we can count the number of transitions at every step and regroup them in a matrix $P$. To see the deviation with the random path we use the Kullback–Leibler ($KL$) divergence, defined element-wise as

$$ D_{KL}(P || Q)_{ij} = P_{ij} \log \frac{P_{ij}}{Q_{ij}} $$

if $P_{ij} > 0$ and $Q_{ij} > 0$ and $0$ otherwise. Higher divergence means that users deviate from the random probabilities.

Difficulty Metrics

The distributions of game duration, path length, back-clicks and ratings for both Voyages and non-Voyages are extracted and compared. The proportion of finished paths is also calculated and compared.

Semantic similarity

The semantic similarity matrices are computed in a few different ways. One way is to compute them directly through the article names using BGEM31 and BERT as embedding model. The similarity between two articles with embedded name vectors a1 and a2 is defined as the cosine similarity.

$$S_C(a_1, a_2)=\frac{a_1 \cdot a_2}{||a_1|| \cdot ||a_2||}$$

Assignment of methods to research questions

  1. Leverage features from HTML parsing and Markov Chains to evaluate article connectivity, transition probabilities, and the influence of link positions on user choices.
  2. Compare difficulty metrics and success rates between Voyages and other strategies. Use semantic similarity analysis to assess whether Voyages exhibit lower cosine similarity between steps compared to other paths taken by users. This allows to find out if users may take more difficult detours with Voyages.
  3. Construct a directed graph to compute optimal paths, and calculate the normalized percentage of times each category is visited. Compare with how often users visit categories along the paths and compare with the optimal paths.

Additional Datasets

No additional datasets are needed to answer the research questions.

Contributions

Our team of five collaborated on the initial exploratory analysis, identifying research questions, and drafting the data story. We then divided specific tasks more concretely:

Camille Challier: Difficulty metrics, page structure analysis, and random path comparisons.

Yannick Detrois: HTMLParser, Website Design and Redaction, Markov Chains, Similarity along paths

David Friou: Data preprocessing, handling category articles, Users and Structure Networks, and proper connection and functionality of the code.

Marine Ract: User Networks, Markov chains, Sankey plots, Website Design.

Marianne Scoglio: Extraction of Voyages, comparaison between user and optimal paths.

References

[1] Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv, 2024.

Quickstart

# clone project
git clone https://github.com/epfl-ada/ada-2024-project-the5outliers.git

# create conda called 'ada_p' with all required packages
conda env create -f requirements.yml

How to use the library

All the results are in the results.ipynb. Running the notebook will showcase the different functionalities and models defined under src.

Project Structure

The directory structure of our project looks like this:

├── data                        <- Project data files
│
├── src                         <- Source code
│   ├── data                            <- Data directory
│   ├── models                          <- Model directory
│   ├── utils                           <- Utility directory
│
├── results.ipynb               <- Our Notebook showing our main results 
│
├── .gitignore                  <- List of files ignored by git
├── requirements.yml            <- File for installing python dependencies
├── config.py                   <- File with the colors dictionary and abbreviations
└── README.md

About

ada-2024-project-the5outliers created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published