This project explores advanced text classification techniques within the scope of information retrieval. By implementing various methods such as TF-IDF, Naive Bayes, Word Embeddings, Latent Semantic Analysis (LSA), and Support Vector Machine (SVM), the study aims to enhance the precision and efficiency of text classification models.
- Document Preprocessing: Streamlines text data for better handling in classification tasks.
- Inverted Index Model: Utilized for efficient document retrieval.
- Naive Bayes Classifier: Implements probabilistic classification with a focus on textual data.
- Word Embeddings and LSA: Explores different embedding techniques to capture semantic meanings.
- SVM Classification: Applied on both regular and transformed (via LSA) text data to compare performance impacts.
Clone this repository and install the required packages listed in requirements.txt
:
git clone https://github.com/Amir-Entezari/Text-Classification-Enhancements
pip install -r requirements.txt
The Large Movie Review Dataset (often referred to as the IMDB dataset) is designed for use in binary sentiment classification, providing a substantial set of 25,000 highly polar movie reviews for training, and 25,000 for testing, making it suitable for developing a benchmark for sentiment analysis. The dataset contains additional unlabeled data for use as well. Each set of reviews is balanced with equal numbers of positive and negative reviews.
- Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
To download the dataset, use the following link: Large Movie Review Dataset
- Download the dataset using the link provided above.
- Extract the dataset using a file archiver that supports
.tar.gz
format or you can use the following commands in your terminal:wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz tar -xzf aclImdb_v1.tar.gz
To run the project, navigate to the notebook directory and explore the experiment.ipynb script:
jupyter notebook experiment.ipynb
- Document Preprocessing: Clean and prepare text data.
- Naive Bayes Classification: Train and test using the Bayesian probability model.
- Word Embeddings with SVM: Utilize different embeddings like Word2Vec, GloVe, and FastText with SVM.
- LSA with SVM: Apply dimensionality reduction before SVM classification to analyze impact on performance and training time.
The project demonstrates that:
- Naive Bayes is highly effective for the targeted text classification tasks.
- Word2Vec provides the best performance among the tested embedding models.
- LSA significantly reduces training time without substantially impacting accuracy.
Different text classification techniques offer varying levels of efficacy depending on the specific dataset and task requirements. Further research and testing with different configurations and larger datasets are recommended to optimize performance and generalizability.
Contributions to this project are welcome. Please fork the repository and submit a pull request with your suggested changes.
Distributed under the MIT License. See LICENSE
for more information.