This repository provides a foundation for building robust and scalable web scrapers using Selenium and Flask. It emphasizes best practices including environment management, configuration with Docker, and a well-structured project layout.
- Selenium Automation: Efficiently interact with dynamic webpages using Selenium's browser automation capabilities.
- Flask Backend: Create a RESTful API with Flask to manage scraper execution, authorization, and logging.
- Bearer Authentication: Implement a secure mechanism for API access using bearer tokens.
- Environment Management: Facilitate deployment across different environments (production, staging) using environment variables.
- Docker Configuration: Streamline containerization for a consistent and portable development experience.
- Logging System: Track scraper activities and errors for debugging and monitoring.
Before diving in, ensure you have the following tools installed:
- Python (version 3.x recommended): Download and install from https://www.python.org/downloads/.
- .env file: Create a
.env
file to store environment variables (refer to.env.example
for guidance). - HTTP Client (Postman recommended): Use an HTTP client like https://www.postman.com/ to send requests to the Flask API.
- Chrome Browser: Download and install the latest version from https://www.google.com/chrome/.
We use a module named virtualenv which is a tool to create isolated Python environments. Virtualenv creates a folder that contains all the necessary executables to use the packages that a Python project would need.
python3 -m venv <whatever_virtual_environment_name>
source <whatever_virtual_environment_name>/bin/activate # for Unix/Linux
.\<whatever_virtual_environment_name>\Scripts\activate # for Windows
pip install -r .\requirements.txt
python3 .\main.py
Now, the server is accessible at http://localhost3000
-
The first thing is to add your .env file. You can add a invented bearer token to get started
-
Then configure the base url in the utils/config.py file
-
In order to work on your project, you must add an endpoint to main.py.
-
Next, create a controller, and add the different web actions on the controller. It is recommended to do actions with few steps, to be able to modularize your code, and not repeat code in the future.
├─── main.py # Entry point for the Flask application
├─── .vscode # Configuration for Visual Studio Code (optional)
├─── actions # Contains scraper actions (logic for data extraction)
├─── controller # Functions handling API requests
├─── temp_downloads # Temporary files created during scraping
└─── utils # Reusable helper functions
- Selenium Web Page: Main bot technology
- Selenium Tutorial
- Flask: Core technology for creating a REST API server