max_samples 参数指定多个数据集的数量报错 #6349

Zbaoli · 2024-12-16T09:14:02Z

Reminder

I have read the README and searched the existing issues.

System Info

- `llamafactory` version: 0.9.2.dev0
- Platform: Linux-5.4.0-124-generic-x86_64-with-glibc2.31
- Python version: 3.10.15
- PyTorch version: 2.5.1+cu124 (GPU)
- Transformers version: 4.46.1
- Datasets version: 3.1.0
- Accelerate version: 1.0.1
- PEFT version: 0.12.0
- TRL version: 0.9.6
- GPU type: NVIDIA GeForce RTX 3090
- DeepSpeed version: 0.15.4
- vLLM version: 0.6.4.post1

Reproduction

配置文件：

dataset: evol_instruct_zh_gpt4,identity,belle_1k
max_samples: 9000,1000,1000

Expected behavior

参数文档上说max_samples参数用于指定每个数据集的最大样本数量，使用逗号分隔。
但我用上面的配置会报错：

[rank2]:     max_samples = min(data_args.max_samples, len(dataset))
[rank2]: TypeError: '<' not supported between instances of 'int' and 'str'

定位到代码这个地方：

def _load_single_dataset(...):
   ...
    if data_args.max_samples is not None:  # truncate dataset
        max_samples = min(data_args.max_samples, len(dataset))
        dataset = dataset.select(range(max_samples))

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-12-17T09:59:49Z

使用 num_samples 而非 max_samples
https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README_zh.md

github-actions bot added the pending This problem is yet to be addressed label Dec 16, 2024

hiyouga closed this as completed Dec 17, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_samples 参数指定多个数据集的数量报错 #6349

max_samples 参数指定多个数据集的数量报错 #6349

Zbaoli commented Dec 16, 2024 •

edited

Loading

hiyouga commented Dec 17, 2024

max_samples 参数指定多个数据集的数量报错 #6349

max_samples 参数指定多个数据集的数量报错 #6349

Comments

Zbaoli commented Dec 16, 2024 • edited Loading

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Dec 17, 2024

Zbaoli commented Dec 16, 2024 •

edited

Loading