LLM Finetuning – Charles Shen

Techniques on Fine-tuning LLMs

Different ways for fine-tuning

Native Pytorch example: fine-tune Llama2 with Alpaca
Pytorch Lightning: example: fine-tune PaliGemma for image
Transformers’ Trainer: example: FIM fine-tune StarCoder and Codellama for code co-pilot
trl’s SFTTrainer: example: fine-tune Llama2 with Alpaca, example: chat fine-tune for code copilot
axolotl: example: fine tune tinyllama

memory optimization for model training

Memory constraints: increasing batch size means increasing training throughput (samples/sec), reduces the training time. To fit more samples into memory, we can use memory optimization techniques. However, memory optimization techniques themselves slow down the training process. There is a balance sweet spot between batch size and memory optimiation techniques. This is part of hyper-parameter tuning.

Method/tool	Improves training speed	Optimizes memory utilization
Batch size choice	Yes	Yes
Gradient accumulation	No	Yes
Gradient checkpointing	No (-20%)	Yes
Mixed precision training	Yes	(No)
Optimizer choice	Yes	Yes
Data preloading	Yes	No
DeepSpeed Zero	No	Yes
torch.compile	Yes	No
Parameter-Efficient Fine Tuning (PEFT)	No	Yes
Flash Attention 2	Y	?

Source: https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one

Parameter tuning

Sequence length

longer sequences increase attention cost
- In new mechanisms such as flash attention, the attention cost is relatively small so we can afford longer sequences. For really big networks, the cost of attention will be completely drown out by the cost of networks.
- for 30B, no penalty for sequence length 2048 to 4096, only 15% penalty for sequence 8192. for 7B, the penalty escalate pretty quickly.
Data needs to justify using longer sequence length
- Many standardized datasets do not have many long sequences
Will the model be able to take advantage of long sequence training?

Positional encodings determine which token is in which part, * Llama uses RoPE, another choice could be AliBi

Tokenizers

tokenizers such as BPE and Sentencepiece, convert text into numerical representations. In certain scenarios, it could be beneficial to train a custom tokenizer and establish a unique vocabulary. For instance, tokens in programming languages can significantly differ from those in natural languages, and tokens in foreign languages may vary greatly from those in English.

The size of the vocabulary can influence the efficiency of the model. A larger vocabulary may decrease computational efficiency due to the increased complexity. However, it can also enhance token efficiency, as a larger vocabulary can represent a piece of text with fewer tokens. Therefore, some of the computational efficiency can be regained. The optimal vocabulary size can also depend on the specific domain. A standard size often used is around 50,000, but sizes ranging from 25,000 to 100,000 have also been successfully utilized in production environments.

However, training a custom tokenizer is a complex task and it’s strongly recommended to compare the results with those from existing generic tokenizers. When we modify our tokenizer or even just adjust the vocabulary size, the loss values between two models may not be directly comparable.

Reference: * Weights & Biases course on LLM Fine-tuning Techniques with Jonathan Frankle of MosaicML * Huggingface page on Performance and Scalability covers latest techniques for both training and inference. * Karpathy has an excellent Tokenizer tutorial.