Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non shuffled data access #6

Open
kektobiologist opened this issue Feb 7, 2024 · 0 comments
Open

Non shuffled data access #6

kektobiologist opened this issue Feb 7, 2024 · 0 comments

Comments

@kektobiologist
Copy link

kektobiologist commented Feb 7, 2024

I looked at the hindi monolingual corpus (this) and it seems to have shuffled lines instead of contiguous news articles (this was specifically mentioned to be the case for the v1 public release here but I can't find that mentioned in the v2 release). Eg. there's random numbered points scattered in the file that are probably related to each other but that context is lost due to shuffling?
Is there a non-shuffled dataset available anywhere, or something with more metadata like scraping URL, date/time etc.?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant