pp-scraper

Portfolio performance compatible scraper for hungarian instruments.

Scraping is done every weekday around 18:00 UTC time as a batch job for all data sources and the results are uploaded to pp-data repository.

Primarily features:

daily price data scraping in a json format per instrument for the below data sources
optional historical price generation

Implemented Spiders

Data source name	Spider name	Notes
Alfa	alfa_nyugdij	Can scrape historical data
Allianz	allianz_nyugdij	Can scrape historical data
Aranykor	aranykor	Scrapes historical data
Bamosz	bamosz	Supports historical scraping with splash
Budapest	budapest_nyugdij	Can scrape historical data, scrapes VPF and PPF funds
Erste	erste_nyugdij	Can scrape historical data from hand-crafted csv
Honved	honved_nyugdij	Can scrape historical data
Horizont	horizont_nyugdij	Can scrape historical data
MÁK	mak	Scrapes only latest data
MÁK	mak_historical	Scrapes historical data from PDF report generator endpoint for a given time range. It uses tesseract OCR for extracting data from the PDF files. Best effort, the OCR makes some mistakes in certain cases for parsing tables
MBH	mbh_nyugdij	Can scrape historical data
OTP	otp_nyugdij	Can scrape historical data
Pannónia	pannonia_nyugdij	Can scrape historical data
Szövetség	szovetseg_nyugdij	Scrapes historical data from excel

For local execution you need to install the following packages.

Install sudo apt-get install python3 python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev python3-venv docker.io tesseract-ocr
pip install -r requirements.txt