Author Name : Shreyansh Padarha
Email : [email protected]
This repository contains a collection of labs that explore various machine learning algorithms and techniques. Each lab focuses on a specific topic and provides detailed explanations, code examples, and analysis. The labs cover clustering, classification and regression algos, hyperparameter tuning, data-preprocessing and various evaluation metrics.
Coverage
-
The algorithms covered within the repository are Linear Regression OLS, Linear Regression with Gradient Descent, Logistic Regression, Decision Tree, Random Forest, SVM (Support Vector Machines), KMeans Clustering, Hierarchichal Clustering (AGNES).
-
The labs also cover various feature engineering, scaling, generating and encoding techniques that help in getting the data prepared for ML solutions.
-
Most Labs also include manual Hyperparameter Tuning (not GridSearchCV), as it helps in better understanding the optimal parameters and their workaround.
-
All Lab Directories have these files:
main_notebook.ipynb
that are jupyter notebooks for those particular labs and contain all the relevant code and its outputs, with detailed line-by line comments.pdf_notebook.pdf
that are pdf versions of the jupyter notebookREADME.md
are detailed markdowns, containing an introduction, tasks, methods, results and observations in-line with the labs.- Additional
excel
,csv
,pickled
files w.r.t. the lab.
NOTE
Please refer to each lab's individual markdown files for a more detailed explanation of the objectives, methods, and algorithms used. You will find code implementations, visualizations, and insights that will help you understand and explore different machine learning algorithms.
Feel free to explore and experiment with these labs to deepen your understanding of machine learning algorithms and their applications!ˇ
Given below are the directories within the repository with their different objectives and sample output screenshotsin "brief".
Brief
This lab aims to analyze the effectiveness of clustering algorithms in simplifying large datasets for machine learning. The specific focus is on comparing KMeans and Agglomerative Clustering methods. The objectives of this lab include:
- Downloading the "Car Evaluation" dataset from the UCI Repository.
- Finding the optimal number of clusters using the Elbow and Silhouette methods.
- Comparing KMeans and Agglomerative Clustering methods for clustering the dataset.
- Validating the optimal number of clusters.
- Tuning hyperparameters for KMeans (n_clusters, max_iter, init, algorithm).
- Tuning hyperparameters for Agglomerative Clustering (n_clusters, metric, linkage).
- Plotting the hierarchical clustering dendrogram.
- Comparing the better clustering algorithm with a classification algorithm.
Sample Output Screenshots
Brief
This lab focuses on understanding and implementing common preprocessing techniques and evaluation metrics from scratch. The objectives of this lab are:
- Loading different sheets of a dataset as Python DataFrames.
- Implementing user-defined functions for measures of central tendency (mean, median, mode).
- Scaling a list of numerical values between 0 and 1.
- Finding the percentile of a number in a given array.
- Categorizing data points into different categories and plotting them.
- Finding the correlation between two sets of values.
- Encoding and decoding a categorical variable into a numerical representation.
- Checking the goodness of fit for a regression model using evaluation metrics.
- Testing the user-defined functions and comparing them with sklearn.preprocessing functions.
- Performing additional tasks such as finding percentiles in a dataset, analyzing the impact of dataset size on - regression evaluation metrics, and exploring the relationship between squared Pearson correlation coefficient and R2 value.
Brief
In this lab, the focus is on customer segmentation and analysis to optimize marketing strategies. The objectives of this lab include:
- Identifying distinct customer segments based on characteristics and behaviors.
- Determining the optimal number of clusters using evaluation techniques like Silhouette analysis and the Elbow method.
- Evaluating the quality and validity of customer segmentation through measures such as silhouette scores.
- Utilizing Random Forest for classifying customers into different segments and identifying key differentiating features.
Sample Output Screenshots
Brief
This lab involves using a webscraped dataset from carDekho to perform exploratory data analysis (EDA) and feature engineering. The objectives of this lab include:
- Feature scaling and encoding/transformation.
- Performing statistical data analysis on the dataset.
- Pickling transformations for future inverse transformations.
- Conducting EDA and formulating questions based on the dataset.
- Preparing the dataset for applying a linear regression model.
Brief
This lab focuses on implementing decision tree classifiers using Sklearn's DecisionTreeClassifier. The tasks include:
- Performing classification on the Titanic dataset.
- Commenting on the accuracy of the model and the impact of different parameters.
- Comparing the results with a Dummy Classifier using different parameters.
Sample Output Screenshots
Brief
This lab dives into the implementation of linear regression using gradient descent. The objectives of this lab include:
- Implementing linear regression with gradient descent
- Visualizing the convergence of the model's loss to the minima
- Analyzing the effects of the number of iterations (epochs) and learning rate on the training process
- Comparing the model with ordinary least squares (OLS) based linear regression
Sample Output Screenshots
Brief
In this lab, you will find a comprehensive implementation of evaluation metrics for binary classification problems. The evaluation metrics implemented from scratch include:
- Confusion matrix
- Accuracy score
- Precision
- Recall
- F1 score
Sample Output Screenshots
Brief
This lab focuses on a comparative study of linear regression models using the carDekho cleaned dataset. The objectives of this lab include:
- Performing exploratory data analysis (EDA) on the dataset
- Implementing linear regression using the scikit-learn library
- Exploring various variations of linear regression available in the statsmodels library
- Comparing the results of different models
- Predicting car prices using the best model
Brief
In this lab, we delve into the world of logistic regression and support vector machines (SVMs) for classification tasks. The objectives of this lab include:
- Creating classification datasets with informative features
- Applying feature scaling techniques to enhance model performance
- Comparing the suitability of logistic regression and SVMs on the datasets
- Evaluating the impact of applied transformations on the classification results