当前位置: 首页>移动开发>正文

Huggingface Transformer 库训练BERT优化技术总结

实验测试内容

Why 为什么采用HuggingFace的软件栈?

  • 随着AI 基础软件和框架的发展, AI 训练过程也变得极为复杂, AI training中往往涉及很多不同配置的组合, 比如 混合精度训练, Low-bit Optimizer, Quantized Graident等等, 因此直接采用PyTorch训练会比较繁琐和复杂。
  • 为此HuggingFace, Microsoft等推出了更High-Level高阶和用户友好的AI训练框架, 这些框架紧跟学术届前沿,不断的将最新的成果集成到各自的库中,增强自身的竞争力和影响力。

主要内容: 使用Huggingface Transformer 库, 配置不同的AI训练参数/选项, 理解这些不同训练参数和优化的意义和原理

(由于本人也是学以致用,因此可能存在理解不到位的地方)

软硬件环境

  • Ubuntu 22.04
  • 1 * RTX 3080 10GB GPU
  • PyTorch 2.0
  • CUDA-12.0
  • Huggingface相关的库: transformers, datasets, accelerate

基本训练过程(Baseline)

以下为最基础的AI模型训练过程,不带任何优化。

  • 训练数据为随机生成的假数据
  • 为了监测训练过程GPU Memory的使用情况, 采用 pynvml库的API输出GPU显存的使用量
  • 模型选择: BERT-based, 由于是采用单GPU训练,选择较小的模型便于观察
  • 训练API: 主要采用 Huggingface Transformer库的 Trainer API, 该API已封装的Training Loop循环
  • 训练结果: 观察GPU显存占用量, 训练吞吐
import numpy as np
from datasets import Dataset
from pynvml import *
import torch
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer, logging

logging.set_verbosity_error()

seq_len, dataset_size = 512, 512
dummy_data = {
    'input_ids': np.random.randint(100, 30000, (dataset_size, seq_len)),
    'labels': np.random.randint(0,1, (dataset_size))
}

ds = Dataset.from_dict(dummy_data)
ds.set_format('pt')

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f'GPU memory occupied: {info.used // 1024**2} MB')

def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()

print_gpu_utilization()

default_args = {
    "output_dir": "tmp",
    "evaluation_strategy": "steps",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}

training_args = TrainingArguments(per_device_train_batch_size=4, 
                                   optim='adafactor',
                                   **default_args)
trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)

输出结果:
{'train_runtime': 16.0498, 'train_samples_per_second': 31.901, 'train_steps_per_second': 7.975, 'train_loss': 0.013442776165902615, 'epoch': 1.0}
Time: 16.05
Samples/second: 31.90
GPU memory occupied: 5790 MB

优化1: + 梯度累加 (Gradient Accumulation)

梯度累加: 是一种时间换空间的思想方法, 使得在有限的GPU Memory条件下允许使用更大的batch_size训练, 这里的空间指的是GPU Memory。传统的一般训练过程, 每计算完一个batch便计算梯度以及进行权重Weight更新, 采用梯度累加的策略之后,每计算完若干batch之后,再进行一次weight update, 每个batch计算中仍然计算梯度,将若干个batch的梯度累加在一起

对比:

  • 无Gradient Accumulation
for idx, batch in enumerate(dataloader):
     # Forward
     loss = model(batch).loss
     # Backward
     loss.backward()
     ...
     
     # Optimizer update
     optimizer.zero_grad()
     optimizer.step()
     ...
  • +Gradient-Accumulation:
    • 代码中可能有疑问? 没看到梯度累加的代码? 实际上是由于PyTorch框架造成的, 每次计算完梯度backward()的时候如果不立即调用optimizer.zero_grad(), 则当前batch计算的梯度就默认累加到之前idx-1的梯度上。
    • 参数: gradient_accumulation_steps 代表多少个batch之后进行一次optimizer update()。 因此实际的training_batch_size = per_device_train_batch_size * gradient_accumulation_steps
for idx, batch in enumerate(dataloader):
     # Forward
     loss = model(batch).loss
     loss = loss / training_args.gradient_accumulation_steps
     # Backward
     loss.backward()
     ...
     if idx % training_args.gradient_accumulation_steps == 0:
     # Optimizer update
     optimizer.zero_grad()     
     optimizer.step()
     ...

测试代码:

training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)

保存training_batch_size不变, 输出结果: GPU Memory占用明显降低 (5790MB --> 4169MB), 训练吞吐略有降低。
per_device_train_batch_size=1, gradient_accumulation_steps=4
{'train_runtime': 19.7445, 'train_samples_per_second': 25.931, 'train_steps_per_second': 6.483, 'train_loss': 0.01618509739637375, 'epoch': 1.0}
Time: 19.74
Samples/second: 25.93
GPU memory occupied: 4169 MB

优化2: + Gradient Checkpointing

Why ? 训练在backward计算某一layer weight的梯度时候, 需要用到Forward阶段该Layer计算得到的Activation输出。 因此每个layer在Forward阶段的Activation输出需要一直保存在GPU Memory, 显然增大了Memory的使用量。

Gradient Checkpoint的原理: 只保存个别Layer 的Activation 输出 (被选中保存的Layer 称为Checkpoint Node), 在反向传播计算采用重计算 (Recomputation)根据最近的Layer的Activation重新计算出当前Layer所需的Activation.

优势 vs. 劣势:

  • 优势: 由于只保存部分Layer 的Activation , 降低了GPU Memory占有
  • 劣势: 重计算引入了额外的计算负担,训练吞吐变慢。

代码实现:

training_args = TrainingArguments(
    per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, **default_args
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)

输出结果: GPU Memory进一步降低 (4169MB --> 3706MB), 吞吐降低: 25.93 --> 20.40
{'train_runtime': 25.1014, 'train_samples_per_second': 20.397, 'train_steps_per_second': 5.099, 'train_loss': 0.015386142767965794, 'epoch': 1.0}
Time: 25.10
Samples/second: 20.40
GPU memory occupied: 3706 MB

优化3: + 混合精度训练 (Mixed-Precision), 低精度

核心思想: 采用低精度的数据类型(Numeric Format) 存储Weight, Activation.Gradient, 并且采用低精度进行计算

优势 vs. 劣势:

  • 优势: Low-precision降低Memory Footprint, 计算复杂度,提高训练速度和吞吐
  • 劣势:使用不当会造成数值溢出,训练发散

AI训练一般采用浮点数据类型(Floating-point) 进行存储和计算, 目前NVIDIA GPU支持的Floating Low-bit precision formats: TF32 --> FP16---> BF16 ---> FP8

[图片上传失败...(image-ecbacb-1692528009679)]

代码实现: 比如fp16=True, bf16=True 采用相应数据类型的混合精度

training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()
print_summary(result)

输出结果: 速度吞吐有提升(20.40 --> 25.91), GPU Memory占有反而有增加,因为Master Weight副本采用FP32存储
{'train_runtime': 19.76, 'train_samples_per_second': 25.911, 'train_steps_per_second': 6.478, 'train_loss': 0.010953620076179504, 'epoch': 1.0}
Time: 19.76
Samples/second: 25.91
GPU memory occupied: 3829 MB

优化4: 低精度Optimizer (8-bit Adam)

# 8bit Adam
import numpy as np
from datasets import Dataset
from pynvml import *
import torch
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer, logging

# 8bit Adam
import bitsandbytes as bnb
from torch import nn
from transformers.trainer_pt_utils import get_parameter_names

# https://huggingface.co/docs/transformers/perf_train_gpu_one

logging.set_verbosity_error()

seq_len, dataset_size = 512, 512
dummy_data = {
    'input_ids': np.random.randint(100, 30000, (dataset_size, seq_len)),
    'labels': np.random.randint(0,1, (dataset_size))
}

ds = Dataset.from_dict(dummy_data)
ds.set_format('pt')

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f'GPU memory occupied: {info.used // 1024**2} MB')

def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()


print_gpu_utilization()

torch.ones((1, 1)).to("cuda")
print_gpu_utilization()

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased').to('cuda')
print_gpu_utilization()

default_args = {
    "output_dir": "tmp",
    "evaluation_strategy": "steps",
    "num_train_epochs": 1,
    "log_level": "error",
    "report_to": "none",
}


# first we need to group the model’s parameters into two groups where to one group we apply weight decay and to the other we don’t. Usually, biases and layer norm parameters are not weight decayed. Then in a second step we just do some argument housekeeping to use the same parameters as the previously used AdamW optimizer.

decay_parameters = get_parameter_names(model, forbidden_layer_types=[nn.LayerNorm])
decay_parameters = [name for name in decay_parameters if 'bias' not in name]


training_args = TrainingArguments(per_device_train_batch_size=1, 
                                  gradient_accumulation_steps=4, 
                                  gradient_checkpointing=True, 
                                  fp16=True, 
                                  optim='adafactor',
                                  **default_args)

optimizer_grouped_parameters = [
    {
        'params': [p for n,p in model.named_parameters() if n in decay_parameters],
        'weight_decay': training_args.weight_decay,
    },
    {
        "params": [p for n, p in model.named_parameters() if n not in decay_parameters],
        "weight_decay": 0.0,
    },
]

optimizer_kwargs = {
    "betas": (training_args.adam_beta1, training_args.adam_beta2),
    "eps": training_args.adam_epsilon,
}

optimizer_kwargs['lr'] = training_args.learning_rate
adam_bnb_optim = bnb.optim.Adam8bit(
    optimizer_grouped_parameters,
    betas=(training_args.adam_beta1, training_args.adam_beta2),
    eps=training_args.adam_epsilon,
    lr=training_args.learning_rate
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None))
result = trainer.train()
print_summary(result)

输出结果:
{'train_runtime': 17.5487, 'train_samples_per_second': 29.176, 'train_steps_per_second': 7.294, 'train_loss': 0.015325695276260376, 'epoch': 1.0}
Time: 17.55
Samples/second: 29.18
GPU memory occupied: 3161 MB

Reference

  • https://huggingface.co/docs/transformers/perf_train_gpu_one
  • NVIDIA GPU White Paper

https://www.xamrdz.com/mobile/4ys1994704.html

相关文章: