The Citation Extractor and Classifier (CEC) is a software that performs automatic annotation of in-text citations in academic papers provided in PDF. You can classify citation by means of two main ensemble models, one utilizing section titles, the other without them. Finally, you can use a mix of the two according to the specific case (suggested). To specify the models to use within the API you have access to WoS mode (for the model not utilizing section titles), WS mode (for the one with section titles), and finally M for the mixed model.
This page describes the Citation Intent Classifier (CIC) component, which is able to identify the citation intent of one or more citation(s) given as input. Citations are classified according to the CiTO ontology and four classes are currently recognized: UsesMethodIn, ObtainsBackgroundFrom, UseConclusionsFrom and CitesForInformation.
The classification part is carried out by an Ensemble Model, which is a combination of six binary classifiers (in Beta release) and a meta classifier built on top of them. The meta classifier carries out the voting process and returns the final classification result. Furthermore, a threshold of 90% confidence has been defined to filter out the results on which the ensemble is not confident enough.
The baseline model surpass the current SOTA Macro-F1 score for the citation intent classification task within the SciCite dataset.
This tool gives you the possibility to classify any number of input sentences given in input in the form of a list of tuples, or as a JSON file. The tool also gives you the possibility to download the results in JSON format.
You have the possibility to select one of three possible working modes:
- With Sections: select this mode if ALL your sentences have also the title of the section in which the citation is contained;
- Without Sections: select this mode if NONE of your sentences contains the title of the section in which the citation is, or if you want to try a classification based on the pure semantic of the sentence at hand;
- Mixed: select this mode if SOME of your sentences have the title of the section in which the citation is contained, and others not. The tool will carry out the entire filtering process and return you the results.
The leadboard is based on Macro-F1 scores of the models tested on the test set of the SciCite Dataset. Highlighted mdoels are the resulting classifiers of this project. The WS models utilize section titles to classify citation sentences, while the WoS models do not make use of section titles and classify raw citation sentences. Models are also presented as different outputs coming from Alpha (described here) and Beta (described here) releases.
# | Model | Macro-F1 Score | Accuracy Score |
---|---|---|---|
1 | EnsIntWS - Beta Release | 89.46 | 90.75 |
2 | EnsCICWS - Alpha Release | 88.99 | 90.32 |
3 | ImpactCite | 88.93 | \ |
4 | EnsIntWoS - Beta Release | 88.48 | 89.73 |
5 | EnsCICWoS - Alpha Release | 87.75 | 88.86 |
6 | CitePrompt | 86.33 | 87.56 |
7 | SciBERT | 86.32 | \ |
-
Final Release:
- Release new template for the web application
- Add a better threshold definition mechanics for classifiers
- Release API:
- Write API:
- Add support for compressed files and folders
- Write Documentation
- Write usage examples
- Write API:
-
Beta Release:
- Add structured README.md
- Add Changelog
- Publish article
- Update base software:
- Update ensemble models
- Improve classifier score
-
Alpha Release:
- Release web application:
- Design web interface
- Develop and publish classification model
- Release web application:
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have any suggestion that would make this project better, please fork the repo and create a pull request. If this sounds too complex, you can simply open an issue with the tag "enhancement". Don't forget to give the project a star!
Distributed under the ISC License. See LICENCE for more information.