This is the assignment of Natural Language Processing course. It consists of analyze and then train a model based on SQUAD 2.0 dataset. The dataset consists of a set of questions and Wikipedia articles containing the answers to the questions. The task for this dataset is to find answers to the question or to respond that the question is unanswerable given the information available.
- Preliminary analysis
- check the type and number of documents that dataset contains
- calculate and visualise some simple statistics for the collection, e.g. the average document length, the average vocabulary size, etc.
- cluster the documents and visualise the clusters to see what types of groups are present (or whether the known classes can be found);
- index the documents so that you can perform keyword search over them;
- train a Word2Vec embedding on the data and investigate the properties of the resulting embedding.
- Training models (BERT)
- train a model to perform that task (by fine-tuning models on the training data);
- test pre-trained models on the task (if they already exist); and evaluate different models and compare their performance.