GitHub - nashvillest/nashvillest_wayback: Nashvillest Content from Wayback Machine

Nashvillest Content from Wayback Machine

So, funny story. A bunch of data ~~may have been~~was definitely lost in a server crash.

This is an attempt to piece the content back together. It uses a Ruby gem to extract the site content from the Internet Wayback Machine.

Scraping the Content

Requirements:

bundler

bundle install
bundle exec wayback_machine_downloader http://nashvillest.com --exclude "/^http:\/\/nashvillest.com(:80)?\/(tag|page|category|\?)\/.*/"

This takes a few hours to complete. The RegEx above is to exclude some pages that would end up being duplicates (and likely not useful for data extraction).

Parsing Out the Articles

Requirements:

node (4.4+)

Parse everything:

npm install
find . -type f -name 'index.html' -exec node parse.js {} \;

Parse a single path:

npm install
node parse.js <path to article>

Generating the Import File

npm install
node import > export/wxr-import.xml

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github		.github
export		export
websites/nashvillest.com		websites/nashvillest.com
.gitignore		.gitignore
.ruby-version		.ruby-version
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
import.js		import.js
package.json		package.json
parse.js		parse.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nashvillest Content from Wayback Machine

Scraping the Content

Parsing Out the Articles

Generating the Import File

About

Releases

Packages

Contributors 3

Languages

nashvillest/nashvillest_wayback

Folders and files

Latest commit

History

Repository files navigation

Nashvillest Content from Wayback Machine

Scraping the Content

Parsing Out the Articles

Generating the Import File

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages