Upsampling: Statistical biasas of distribution of dataset #15

michaelfeil · 2024-04-11T15:53:04Z

I think there are some statistical biases in this implementation for long context engineering.

Concern 1:
For upsample mode, some datasets groups get filtered when their capacity is maxed out. e.g for --down_sample_mode=upsample_code_arxiv_book, the code, arxiv and book datasets will be mostly at the end of our created syntetic dataset.

Concern 2:
Start token_id 1. With the llama-tokenizer, when a single passage is tokenized, it is started by <s> or token_id1. When concetenating different pretokenized texts, its not the same result as if the strings are added and then tokenized together.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upsampling: Statistical biasas of distribution of dataset #15

Upsampling: Statistical biasas of distribution of dataset #15

michaelfeil commented Apr 11, 2024 •

edited

Loading

Upsampling: Statistical biasas of distribution of dataset #15

Upsampling: Statistical biasas of distribution of dataset #15

Comments

michaelfeil commented Apr 11, 2024 • edited Loading

michaelfeil commented Apr 11, 2024 •

edited

Loading