Incremental Maven releases to Kafka

NOTE: This scraper is not working anymore since maven-repository.com is down.

This Python script scrapes maven-repository.com and forwards it to Kafka. The scraper script requires: start_date, kafka_topic, bootstrap_servers and sleeptime. It will scrape all releases until start_date and push this to the kafka_topic running on bootstrap_servers. It will keep repeating after sleep_time seconds with start_time == date_of_latest_release. I.e. it scrapes incremental updates on Maven releases.

Prerequisites

Install all dependencies:

python3 -m venv venv
. ./venv/bin/activate
pip install requests BeautifulSoup4 kafka-python

How To Run

usage: Scrape Maven releases to Kafka. [-h]
                                       start_date topic bootstrap_servers
                                       sleep_time

For example:

python scraper.py '2019-06-24 14:05:50' cf_mvn_releases localhost:29092 60

This will scrape up to 2019-06-24 14:05:50 (+ incremental updates) pushes it to cf_maven_releases located at localhost:29092. Incremental updates are checked every 60 seconds.

Note: start_date must be in %Y-%m-%d %H:%M:%S format. Multiple bootstrap servers should be , separated. Sleep time is in seconds.

Sample data

Data will be send in the following format:

{
  "groupId": "com.g2forge.alexandria",
  "artifactId": "alexandria",
  "version": "0.0.9",
  "date": "2019-06-24 14:42:49"
}

Run in Docker

docker build -t mvn-scraper .
docker run mvn-scraper '2019-06-24 14:05:50' cf_mvn_releases localhost:29092 60

or alternatively

docker run wzorgdrager/mvn-scraper '2019-06-24 14:05:50' cf_mvn_releases localhost:29092 60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Incremental Maven releases to Kafka

Prerequisites

How To Run

Sample data

Run in Docker

Files

README.md

Latest commit

History

README.md

File metadata and controls

Incremental Maven releases to Kafka

Prerequisites

How To Run

Sample data

Run in Docker