Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarked Resiliparse & added flag to evaluate parsers individually #25

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

KhoomeiK
Copy link

@KhoomeiK KhoomeiK commented Oct 7, 2024

Resiliparse is actively used by some AI labs to extract web data for LLM pre-training, but it has not been publicly benchmarked alongside many other similar web parsing tools. I've added an eval script for Resiliparse as well as its data output. I also added a flag to eval individual parsers separately.

@KhoomeiK KhoomeiK changed the title Benchmarked Resiliparse & added flag to evaluate an individual parser Benchmarked Resiliparse & added flag to evaluate parsers individually Oct 7, 2024
Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for contributing a new extractor @KhoomeiK . I left a few small comments - also if you prefer I could merge your PR as-is and address them in another PR.

Besides that, do you mind also updating the README with the result of the new parser, adding a line at the end of Result of packages added after original evaluation: table?

try:
extractor_module = importlib.import_module(f'extractors.run_{name}')
extractor_module.main()
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather catch Exception here, e.g. see motivation in this (rejected) PEP https://peps.python.org/pep-0760/#motivation

Suggested change
except:
except Exception:

metrics_by_name[name] = metrics
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to refactor the code in a way which does not leave to having to repeat the reporting. For example, we could pass args.parser to evaluate function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants