You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think there are some statistical biases in this implementation for long context engineering.
Concern 1:
For upsample mode, some datasets groups get filtered when their capacity is maxed out. e.g for --down_sample_mode=upsample_code_arxiv_book, the code, arxiv and book datasets will be mostly at the end of our created syntetic dataset.
Concern 2:
Start token_id 1. With the llama-tokenizer, when a single passage is tokenized, it is started by <s> or token_id1. When concetenating different pretokenized texts, its not the same result as if the strings are added and then tokenized together.
The text was updated successfully, but these errors were encountered:
I think there are some statistical biases in this implementation for long context engineering.
Concern 1:
For
upsample
mode, some datasets groups getfiltered
when their capacity is maxed out. e.g for--down_sample_mode=upsample_code_arxiv_book
, the code, arxiv and book datasets will be mostly at the end of our created syntetic dataset.Concern 2:
Start token_id
1
. With the llama-tokenizer, when a single passage is tokenized, it is started by<s>
or token_id1
. When concetenating different pretokenized texts, its not the same result as if the strings are added and then tokenized together.The text was updated successfully, but these errors were encountered: