Techniques on Fine-tuning LLMs
Different ways for fine-tuning
- Native Pytorch example: fine-tune Llama2 with Alpaca
- Pytorch Lightning: example: fine-tune PaliGemma for image
- Transformers’ Trainer: example: FIM fine-tune StarCoder and Codellama for code co-pilot
- trl’s SFTTrainer: example: fine-tune Llama2 with Alpaca, example: chat fine-tune for code copilot
- axolotl: example: fine tune tinyllama
memory optimization for model training
Memory constraints: increasing batch size means increasing training throughput (samples/sec), reduces the training time. To fit more samples into memory, we can use memory optimization techniques. However, memory optimization techniques themselves slow down the training process. There is a balance sweet spot between batch size and memory optimiation techniques. This is part of hyper-parameter tuning.
Method/tool | Improves training speed | Optimizes memory utilization | Note |
---|---|---|---|
Batch size choice | Yes | Yes | |
Gradient accumulation | No | Yes | |
Gradient checkpointing | No (-20%) | Yes | |
Mixed precision training | Yes | (No) | |
Optimizer choice | Yes | Yes | |
Data preloading | Yes | No | |
DeepSpeed Zero | No | Yes | |
torch.compile | Yes | No | |
Parameter-Efficient Fine Tuning (PEFT) | No | Yes | |
Flash Attention 2 | Y | ? |
Source: https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one
Parameter tuning
Sequence length
- longer sequences increase attention cost
- In new mechanisms such as flash attention, the attention cost is relatively small so we can afford longer sequences. For really big networks, the cost of attention will be completely drown out by the cost of networks.
- for 30B, no penalty for sequence length 2048 to 4096, only 15% penalty for sequence 8192. for 7B, the penalty escalate pretty quickly.
- Data needs to justify using longer sequence length
- Many standardized datasets do not have many long sequences
- Will the model be able to take advantage of long sequence training?
Positional encodings determine which token is in which part, * Llama uses RoPE, another choice could be AliBi
Tokenizers
tokenizers such as BPE and Sentencepiece, convert text into numerical representations. In certain scenarios, it could be beneficial to train a custom tokenizer and establish a unique vocabulary. For instance, tokens in programming languages can significantly differ from those in natural languages, and tokens in foreign languages may vary greatly from those in English.
The size of the vocabulary can influence the efficiency of the model. A larger vocabulary may decrease computational efficiency due to the increased complexity. However, it can also enhance token efficiency, as a larger vocabulary can represent a piece of text with fewer tokens. Therefore, some of the computational efficiency can be regained. The optimal vocabulary size can also depend on the specific domain. A standard size often used is around 50,000, but sizes ranging from 25,000 to 100,000 have also been successfully utilized in production environments.
However, training a custom tokenizer is a complex task and it’s strongly recommended to compare the results with those from existing generic tokenizers. When we modify our tokenizer or even just adjust the vocabulary size, the loss values between two models may not be directly comparable.
Reference: * Weights & Biases course on LLM Fine-tuning Techniques with Jonathan Frankle of MosaicML * Huggingface page on Performance and Scalability covers latest techniques for both training and inference. * Karpathy has an excellent Tokenizer tutorial.