Predicting the Language of GitHub Repositories by README using NLP

About the Project

Goals

Using natural language processing, web scraping, and classification, we aim to create a machine learning model to predict the primary programming language of a given repository on GitHub, based on the contents of its README.

Background

GitHub automatically shows the percentages of what coding languages are used in the files of a repository. In this project, we are seeking to label and predict on only the primary language of each repository. These languages include Java, Python, Javascript, Ruby, HTML, and C++.

Deliverables

A well-documented Jupyter Notebook that contains our analysis
A Google Slides presentation suitable for a general audience that summarizes our findings and includes visualizations

Data Dictionary

Feature Name	Description	Additional Info
repo	The end URL to the project. Can append to https://github.com/ to get a repo's full URL. Consists of the user, slash the name of the repository.	object
language	The primary programming language of a repository, according to GitHub's auto-analysis.	object
readme_contents	Messy, uncleaned text from a repo's README file as a single string.	object
stemmed	Readme_contents with each word stemmed, i.e. dimensionality reduction such that 'call', 'called', and 'calling' are treated as the same word. Stems are not the same as root words and do not always appear in the dictionary.	object
lemmatized	Similar to stemmed, but reduces words to its root word, which will always appear in the dictionary.	object
clean	Lemmatized readme_contents with stopwords removed.	object
stopwords_removed	The number of stopwords that were removed from the clean text.	int64
doc_length	How long a repo's README is.	int64
words	The clean text in array form.	object

Tools & Requirements

Python v3.85 (including packages WordCloud, NLTK, and Scikit-Learn)
GitHub's API

License & Reproduction

Anyone can reproduce this project. All we ask is that you credit us if you use our work as part of your own project.

Clone this repo.
Acquire the data:
- a. Go here and generate a personal access token. You do not need select any scopes, i.e. leave all the checkboxes unchecked.
- b. Save it in your env.py file under the variable github_token.
- c. Add your github username to your env.py file under the variable github_username.
- d. Add more repositories to the REPOS list below if you so choose.
Add any extra stop-words in the prepare.py file.
Run the code in the nlp_model Jupyter Notebook.

Creators

Kwame V. Taylor, Data Scientist
Adam Gomez, Data Scientist

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.gitignore		.gitignore
README.md		README.md
acquire.py		acquire.py
adams_sketchbook.ipynb		adams_sketchbook.ipynb
data.json		data.json
explore.py		explore.py
git_repos.json		git_repos.json
kwames_sketchbook.ipynb		kwames_sketchbook.ipynb
model.py		model.py
nlp_model.ipynb		nlp_model.ipynb
prepare.py		prepare.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting the Language of GitHub Repositories by README using NLP

About the Project

Goals

Background

Deliverables

Data Dictionary

Tools & Requirements

License & Reproduction

Creators

About

Contributors 2

Languages

NLP-404-Not-Found/nlp-project

Folders and files

Latest commit

History

Repository files navigation

Predicting the Language of GitHub Repositories by README using NLP

About the Project

Goals

Background

Deliverables

Data Dictionary

Tools & Requirements

License & Reproduction

Creators

About

Resources

Stars

Watchers

Forks

Contributors 2

Languages