Skip to content

Latest commit

 

History

History
38 lines (28 loc) · 2.67 KB

README.md

File metadata and controls

38 lines (28 loc) · 2.67 KB

pp-scraper

Portfolio performance compatible scraper for hungarian instruments.

Scraping is done every weekday around 18:00 UTC time as a batch job for all data sources and the results are uploaded to pp-data repository.

Primarily features:

  • daily price data scraping in a json format per instrument for the below data sources
  • optional historical price generation

Implemented Spiders

Data source name Spider name Notes
Alfa alfa_nyugdij Can scrape historical data
Allianz allianz_nyugdij Can scrape historical data
Aranykor aranykor Scrapes historical data
Bamosz bamosz Supports historical scraping with splash
Budapest budapest_nyugdij Can scrape historical data, scrapes VPF and PPF funds
Erste erste_nyugdij Can scrape historical data from hand-crafted csv
Honved honved_nyugdij Can scrape historical data
Horizont horizont_nyugdij Can scrape historical data
MÁK mak Scrapes only latest data
MÁK mak_historical Scrapes historical data from PDF report generator endpoint for a given time range. It uses tesseract OCR for extracting data from the PDF files. Best effort, the OCR makes some mistakes in certain cases for parsing tables
MBH mbh_nyugdij Can scrape historical data
OTP otp_nyugdij Can scrape historical data
Pannónia pannonia_nyugdij Can scrape historical data
Szövetség szovetseg_nyugdij Scrapes historical data from excel

Installation

For local execution you need to install the following packages.

Ubuntu

  1. Install sudo apt-get install python3 python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev python3-venv docker.io tesseract-ocr
  2. pip install -r requirements.txt