This project is designed to defeat as many email obfuscation methods as possible, creating a single bot capable of crawling the web and harvesting emails. It supports common and uncommon obfuscation methods such as Cloudflare email protection, ROT Cipher, HTML entity decoding, RTL (Right-to-Left) obfuscation, JavaScript-based obfuscation, SVG-encoded emails, Hex and Unicode obfuscation, object and iframe embedded addresses, JavaScript hrefs, splitting addresses with comments, Base64 encoding, basic AJAX and API request obfuscation, text-based obfuscation, and many more coming soon!
- Email Extraction: Scrapes email addresses from HTML content.
- Obfuscation Handling: Decodes obfuscated emails, including JavaScript-based methods.
- Depth-based Crawling: Crawls through websites up to a specified depth, staying within the domain or subdirectories.
- Email Validation: Validates email addresses against known standards and checks DNS records for each domain.
- Logging: Outputs logs to a file for debugging and analysis.
- Ensure Go is installed on your system. Download Go.
- Clone the repository or download the source code.
git clone https://github.com/Pythoript/email-scraper.git
cd email-scraper
- Install dependencies:
go mod tidy
- Compile the project:
go build -o run
URL
(required): The URL where the crawl starts.-v
,--verbose
: Enable verbose logging.--disable-cookies
: Disable cookies during requests.--log <logfile>
: Log output to the specified file.-o
,--output <filename>
: Output file to save scraped emails (default:emails.txt
).--skip-validation
: Skip the email validation.--user-agent <user-agent>
: Custom User-Agent string for requests.--max-depth <depth>
: Set the maximum crawling depth (default: 3).--domain-mode <mode>
: Set crawling domain mode:1
: Stay within the current site (default).2
: Explore subdirectories.3
: Unrestricted.
To run the crawler with verbose output, skip email validation, and save emails to a file:
./run https://example.com --verbose --skip-validation --output emails.txt
- Extracts emails from:
- Normal email addresses found in the page content.
- Obfuscated emails (like
data-cfemail
attributes). - Emails encoded in SVG images.
- Emails obfuscated in JavaScript.
The crawler supports multiple levels of recursion, allowing it to traverse deeper into a website. The --max-depth
flag controls how many levels deep the crawler will go.
Logs are generated for important actions, errors, and other debugging information. You can specify a log file using the --log
flag.
- Add OCR support.
- Capture redirects to
mailto
. - Support CSS pseudo-element encoding.
- Remove non-visible HTML elements.
This project is licensed under the AGPL-3.0 License - see the LICENSE file for details.