License | |
Dependencies | |
Meta |
EDGARDSRS is a Python library designed to clean and process SEC EDGAR 10-K filing HTML files. It removes unnecessary HTML elements, various types of noise/gibberish text, and extract tables with high numeric content to produce clean, readable text output suitable for analysis.
- HTML cleaning and text extraction
- Removal of financial tables and numeric-heavy content
- Extract financial tables
- Elimination of noisy text and gibberish
- Unicode normalization
- Special character handling
- Multiple HTML parser support (html.parser, lxml, html5lib)
pip install edgardsrs
Required dependencies:
- beautifulsoup4
- lxml
- html5lib
- unicodedata
Basic usage to clean a 10-K HTML file:
from edgardsrs import EdgarDSRS
analyzer = EdgarDSRS()
# Cleaning the file
input_file = "your_10k_file.html"
cleaned_file = analyzer.process_html_file(input_file)
The tool performs the following cleaning operations:
- HTML Parsing: Attempts to parse HTML using multiple parsers (html.parser, lxml, html5lib)
- Tag Removal: Strips all HTML tags while preserving text content
- Unicode Normalization: Normalizes Unicode characters
- Noise Removal:
- Removes sequences with high special character density
- Eliminates base64 encoded patterns
- Cleans up lines with excessive non-alphanumeric characters
- Text Cleaning:
- Removes noisy words (mixed case with numbers, excessive length)
- Normalizes whitespace
Main function to clean HTML content and extract text.
text = EdgarDSRS.clean_html_content(html_content)
Function to extract tables.
soup = BeautifulSoup(html_content, "html.parser")
tables = extract_and_format_tables(soup)
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Pratik Relekar | Xinyao Qian
This library was developed at Data Science Research Services(University of Illinois at Urbana-Champaign) in 2024 and has been under active development since then.
For general questions and discussions, visit DSRS mailing list.