1. Commensense QA -- 1 dataset a. PIQA valid set, 1838 samples, test use 3mins 35s 2. Code -- 1 datasets a. Humaneval pass@1 no implement exec 3. MATH -- 1 datasets a. GSM8k 8-shot, examples are random selected from testset. done testing use 1 hour on 1 80G A800 4. MMLU -- 1 dataset 5-shot -- done, 14042 samples, testing use 40mins on 1 80G A800 5. BookSUM -- 1 dataset not a basic ability