loss turns to 0 after several steps for llama2 #13

liuxiaozhu01 · 2024-12-09T13:32:21Z

The script I run is shown below

#!/bin/bash

# model path and name
MODEL="/root/Llama-2-7b-hf"
MODEL_NAME="llama-2-7b-hf"

# hyper params
BS=16
SEED=0
TRAIN=1000
DEV=100
EVAL=1000
STEPS=20000
EVAL_STEPS=500
NEPOCHS=5

# mode and task
MODE="ft"
TASK="WinoGrande"

# tag
TAG=mezo-$MODE-$STEPS-$BS-$SEED

# hyper params for zo
LEARNING_RATES=(1e-7)
ZO_EPS=(1e-4)
WEIGHT_DECAYS=(0)

for LR in "${LEARNING_RATES[@]}"; do
  for EPS in "${ZO_EPS[@]}"; do
    for WD in "${WEIGHT_DECAYS[@]}"; do
      echo "Running with learning_rate=$LR, zo_eps=$EPS and weight_decay=$WD"
      CUDA_VISIBLE_DEVICES=3 python run.py \
        --model_name $MODEL \
        --task_name $TASK \
        --output_dir result/$TASK-${MODEL_NAME}-$TAG --tag $TAG --train_set_seed $SEED --num_train $TRAIN --num_dev $DEV --num_eval $EVAL --logging_steps 10 \
        --max_steps $STEPS \
        --num_train_epochs $NEPOCHS --no_reparam \
        --trainer zo_sgd --load_float16 \
        --learning_rate $LR --zo_eps $EPS --weight_decay $WD --per_device_train_batch_size $BS --lr_scheduler_type "constant" \
        --load_best_model_at_end --evaluation_strategy steps --save_strategy steps --save_total_limit 1 \
        --eval_steps $EVAL_STEPS --save_steps $EVAL_STEPS \
        --train_as_classification=False --perturbation_mode=two_side
    done
  done
done

The output is

2024-12-09 21:22:51,772 - INFO - Sample train set 1100/2558
2024-12-09 21:22:51,772 - INFO - ... including dev set 100 samples
2024-12-09 21:22:51,772 - INFO - Loading model with FP16...
39
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.04s/it]
2024-12-09 21:23:04,614 - INFO - Done with 12.84s
2024-12-09 21:23:04,683 - INFO - Dev samples: 100
2024-12-09 21:23:04,684 - INFO - Train samples: 1000
2024-12-09 21:23:04,684 - INFO - Eval sample length is 1000
2024-12-09 21:23:04,684 - INFO - Tokenizing training samples...
2024-12-09 21:23:05,719 - INFO - Done with 1.03s
/root/miniconda3/envs/mezo/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
2024-12-09 21:23:05,771 - INFO - ***** Running training *****
2024-12-09 21:23:05,771 - INFO -   Num examples = 1000
2024-12-09 21:23:05,771 - INFO -   Num Epochs = 318
2024-12-09 21:23:05,771 - INFO -   Instantaneous batch size per device = 16
2024-12-09 21:23:05,772 - INFO -   Total train batch size (w. parallel, distributed & accumulation) = 16
2024-12-09 21:23:05,772 - INFO -   Gradient Accumulation steps = 1
2024-12-09 21:23:05,772 - INFO -   Total optimization steps = 20000
2024-12-09 21:23:05,773 - INFO -   Number of trainable parameters = 6738415616
  0%|                                                                                                                                                     | 0/20000 [00:00<?, ?it/s]### layer-wise gradient sparsity = None
-------------------------- Training Epoch 0 --------------------------
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'loss': 2.5133, 'learning_rate': 1e-07, 'epoch': 0.16}                                                                                                                             
{'loss': 2.6514, 'learning_rate': 1e-07, 'epoch': 0.32}                                                                                                                             
{'loss': 2.5057, 'learning_rate': 1e-07, 'epoch': 0.48}                                                                                                                             
{'loss': 2.583, 'learning_rate': 1e-07, 'epoch': 0.63}                                                                                                                              
{'loss': 2.4912, 'learning_rate': 1e-07, 'epoch': 0.79}                                                                                                                             
{'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.95}                                                                                                                                
  0%|▍                                                                                                                                         | 63/20000 [00:33<2:49:30,  1.96it/s]-------------------------- Training Epoch 1 --------------------------
{'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 1.11}

I know that lr & zo_eps are significant for the optimization process. Are there usable hyper-params? Or is there something else I set wrong?

Would anyone be able to help? Many thanks for considering my request.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss turns to 0 after several steps for llama2 #13

loss turns to 0 after several steps for llama2 #13

liuxiaozhu01 commented Dec 9, 2024

loss turns to 0 after several steps for llama2 #13

loss turns to 0 after several steps for llama2 #13

Comments

liuxiaozhu01 commented Dec 9, 2024