This program scrapes real estate data from the site hr.gov.ge and saves it to a database.
bundle install
cp .env.example .env
- Fill in
.env
variables. Unless you set ENVIRONMENT to production, emails will be sent to mailcatcher and you can use any fake email addresses you want. - Create mysql database and database user specified in
.env
. - Optional: Load data in from compressed database file (
hr.gov.ge.sql.gz
). See section The Data below.
rake scraper:run
-> Run the scraper!rake scraper:schedule:run_daily
-> Schedule a cron job to runrake scraper:run
every day at 4 AM.
No automatic test suite, but you can use
rake scraper:test_run
to manually test the scraper. Differences include:
- ENVIRONMENT cannot be set to production
- Only 20 ads will be scraped
- Emails will be sent to mailcatcher
- Database dump and status.json will not be pushed to github
When you run the scraper, the following happens:
- Choose Ads to Scrape: The scraper gets a list of all posting IDs in the database. It then gets all of the IDs in the list of jobs for hr.gov.ge and only keeps the IDs that are not already in the system saves to
ids_to_process
instatus.json
. - Scrape Ads: Requests are sent to hr.gov.ge to the ads listed in
ids_to_process
and are saved asdata.json
files in thedata
folder. - Save Ads to Database: The ad info in the new
data.json
files are saved to the database. - Update Github with New Data: The database is dumped to
real-estate.sql.gz
and pushed to github, along with the newstatus.json
file. - Send Email Report on Scraper Run: A report about the scrape run, including basic statistics and logged errors, is sent to the recipient specified in the
.env
file.
The database is pushed with every scrape run to the github repo. That means you can use what others have already scraped to start out your database of hr.gov.ge real estate data. However, because updating github is built into the app, you will have to do one of the following:
- Setup your own origin repo on github to receive your new scrape data.
- Disable pushes to github.
The following files contain code that is specific to hr.gov.ge.
- global_attributes.rb - Contains a list of all global variables used in throughout the script such as the search url, posting url, file paths, file names, and the list of the labels that are on the posting page.
- postings_database.rb - Creates the posting table if it does not exist and is called automatically when the scraper runs
- hr.gov.ge.rb - Defines how the IDs from the listing pages are gathered, calls each of these pages, and then starts the processing
- process_response.rb - Sets how each field is pulled from the web page. For hr.gov.ge, the majority of the page follows the format of title being in a
dt
tag while the value is in the proceedingdd
tag. - utilities.rb - A collection of common methods used throughout the scraping process. Defines the json template for saving the web page data. Creates the SQL insert statement for adding data to the database.