EDGARDSRS: Tool for SEC 10-k files


License
Dependencies
Meta

Description:

EDGARDSRS is a Python library designed to clean and process SEC EDGAR 10-K filing HTML files. It removes unnecessary HTML elements, various types of noise/gibberish text, and extract tables with high numeric content to produce clean, readable text output suitable for analysis.

Features

HTML cleaning and text extraction
Removal of financial tables and numeric-heavy content
Extract financial tables
Elimination of noisy text and gibberish
Unicode normalization
Special character handling
Multiple HTML parser support (html.parser, lxml, html5lib)

Installation

pip install edgardsrs

Required dependencies:

beautifulsoup4
lxml
html5lib
unicodedata

Usage

Basic usage to clean a 10-K HTML file:

from edgardsrs import EdgarDSRS

analyzer = EdgarDSRS()

# Cleaning the file
input_file = "your_10k_file.html"
cleaned_file = analyzer.process_html_file(input_file)

Cleaning Process

The tool performs the following cleaning operations:

HTML Parsing: Attempts to parse HTML using multiple parsers (html.parser, lxml, html5lib)
Tag Removal: Strips all HTML tags while preserving text content
Unicode Normalization: Normalizes Unicode characters
Noise Removal:
- Removes sequences with high special character density
- Eliminates base64 encoded patterns
- Cleans up lines with excessive non-alphanumeric characters
Text Cleaning:
- Removes noisy words (mixed case with numbers, excessive length)
- Normalizes whitespace

Functions

`clean_html_content(html_content)`

Main function to clean HTML content and extract text.

text = EdgarDSRS.clean_html_content(html_content)

`extract_and_format_tables`

Function to extract tables.

soup = BeautifulSoup(html_content, "html.parser")
tables = extract_and_format_tables(soup)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Pratik Relekar | Xinyao Qian

Background

This library was developed at Data Science Research Services(University of Illinois at Urbana-Champaign) in 2024 and has been under active development since then.

Getting help

For general questions and discussions, visit DSRS mailing list.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github/workflows		.github/workflows
edgardsrs		edgardsrs
LICENSE		LICENSE
README.md		README.md
example.py		example.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDGARDSRS: Tool for SEC 10-k files

Description:

Features

Installation

Usage

Cleaning Process

Functions

`clean_html_content(html_content)`

`extract_and_format_tables`

Contributing

License

Author

Background

Getting help

About

Releases 2

Packages

Contributors 2

Languages

License

pratikrelekar/EdgarDSRS

Folders and files

Latest commit

History

Repository files navigation

EDGARDSRS: Tool for SEC 10-k files

Description:

Features

Installation

Usage

Cleaning Process

Functions

clean_html_content(html_content)

extract_and_format_tables

Contributing

License

Author

Background

Getting help

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

`clean_html_content(html_content)`

`extract_and_format_tables`

Packages