You can look at our website here.
While playing Wikispeedia, we observed that we tend to adopt a strategy of navigating through articles about World Regions to reach the target, an approach we term Wikispeedia Voyages. A Voyage is defined as a path where neither the source nor target is in the World Regions category, but the path includes at least one article from this category. The Wikispeedia dataset collects both information about players' behaviour and the network structure of Wikipedia articles. Our aim is to understand to what extent the article and network structure, analyzed through Markov Chains, influences gameplay and the choice to undertake Voyages. In parallel, user difficulty in Voyages are also compared to other strategies, their alignment with shortest paths, as well as their insights from semantic similarity of article names along the paths.
Our results show that World Regions articles are highly connected, with a dense network of links. Moreover, a Markov Chains analysis showed that users navigate through this category more frequently than random walks in the network could suggest. Interestingly, a comparison of user paths with optimal (shortest) paths reveals that optimal paths leverage World Regions more often than users, suggesting that Voyages are effective strategies that players might underuse. Users achieve a higher success rate in reaching their targets when employing Wikispeedia Voyages, but seem to take longer and need more back-clicks, which could be due to the lesser semantic similarity observed throughout Voyages. However, only a small subset of articles within World Regions plays a particularly significant role in facilitating successful navigation, inviting to rather consider only larger countries, continents or regions for Voyages.
- Does the Wikispeedia article and network structure intrinsically favour Wikispeedia Voyages? For example, are World Regions more numerous or more connected? Does the page structure of articles have an influence on Wikispeedia Voyages?
- Are users faster or more efficient when taking Wikispeedia Voyages, or do they take semantic detours that could complicate the path?
- How does the strategy compare with the algorithmic shortest paths?
Markov Chains are used to model the influence of the network structure that could inherently bias user paths. Every article is assigned transition probabilities to all other articles based on the number of links present in this article. The transition matrix's
To compare with the user paths, we can count the number of transitions at every step and regroup them in a matrix
if
The semantic similarity matrices are computed in a few different ways. One way is to compute them directly through the article names using BGEM31 and BERT as embedding model. The similarity between two articles with embedded name vectors a1 and a2 is defined as the cosine similarity.
- Leverage features from HTML parsing and Markov Chains to evaluate article connectivity, transition probabilities, and the influence of link positions on user choices.
- Compare difficulty metrics and success rates between Voyages and other strategies. Use semantic similarity analysis to assess whether Voyages exhibit lower cosine similarity between steps compared to other paths taken by users. This allows to find out if users may take more difficult detours with Voyages.
- Construct a directed graph to compute optimal paths, and calculate the normalized percentage of times each category is visited. Compare with how often users visit categories along the paths and compare with the optimal paths.
No additional datasets are needed to answer the research questions.
Our team of five collaborated on the initial exploratory analysis, identifying research questions, and drafting the data story. We then divided specific tasks more concretely:
Camille Challier: Difficulty metrics, page structure analysis, and random path comparisons.
Yannick Detrois: HTMLParser, Website Design and Redaction, Markov Chains, Similarity along paths
David Friou: Data preprocessing, handling category articles, Users and Structure Networks, and proper connection and functionality of the code.
Marine Ract: User Networks, Markov chains, Sankey plots, Website Design.
Marianne Scoglio: Extraction of Voyages, comparaison between user and optimal paths.
[1] Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv, 2024.# clone project
git clone https://github.com/epfl-ada/ada-2024-project-the5outliers.git
# create conda called 'ada_p' with all required packages
conda env create -f requirements.yml
All the results are in the results.ipynb
. Running the notebook will showcase the different functionalities and models defined under src.
The directory structure of our project looks like this:
├── data <- Project data files
│
├── src <- Source code
│ ├── data <- Data directory
│ ├── models <- Model directory
│ ├── utils <- Utility directory
│
├── results.ipynb <- Our Notebook showing our main results
│
├── .gitignore <- List of files ignored by git
├── requirements.yml <- File for installing python dependencies
├── config.py <- File with the colors dictionary and abbreviations
└── README.md