Non shuffled data access #6

kektobiologist · 2024-02-07T06:13:55Z

I looked at the hindi monolingual corpus (this) and it seems to have shuffled lines instead of contiguous news articles (this was specifically mentioned to be the case for the v1 public release here but I can't find that mentioned in the v2 release). Eg. there's random numbered points scattered in the file that are probably related to each other but that context is lost due to shuffling?
Is there a non-shuffled dataset available anywhere, or something with more metadata like scraping URL, date/time etc.?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non shuffled data access #6

Non shuffled data access #6

kektobiologist commented Feb 7, 2024 •

edited

Loading

Non shuffled data access #6

Non shuffled data access #6

Comments

kektobiologist commented Feb 7, 2024 • edited Loading

kektobiologist commented Feb 7, 2024 •

edited

Loading