Ubuntu GPU Server Sertup
Install Cuda
sudo ubuntu-drivers devices # Check driver information
lspci | grep -i nvidia
sudo lshw -C display
nvcc -V
watch -n 1 nvidia-smi
Install Torch
Upgrade pip first
pip3 install --upgrade pip setuptools wheel
pip3 install torch torchvision torchaudio
If using Poetry
instead of pip
, try poetry env use python3.11
in case we encounter torch installation issues. Also try pip install
to see which version is available to install.
Sometimes we encounter issues for poetry
with torch
related installation on Mac, and sometimes the order of installation matters. Of the latest test, poetry add sentence-transformers
does work, which will install torch
, transformers
and sentence-transformers
.
Install Unsloth
We will use Unsloth to fine tune Llama 3 on local GPUs.
Create the environment with torch and huggingface
python3.11 -m venv venv
pip install torch torchvision torchaudio
pip install ipython ipywidgets huggingface_hub
Install unsloth Check installation page for exact commands. E.g.,
# RTX 3090, 4090 Ampere GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
Check the installation
nvcc -V
python -m xformers.info
python -m bitsandbytes
Install Wandb
wandb
can be used to track progress:
pip install wandb
wandb login
Download gated models from Huggingface
In order to download gated models (e.g. Llama) from huggingface, we can
- set the
HF_TOKEN
environmental variable.
export HF_TOKEN=hf_xxxxx
echo $HF_TOKEN
- Or assign directly:
os.environ["HF_TOKEN"]=hf_xxxxx
Then we can use it:
= os.getenv("HF_TOKEN")
token = AutoPeftModelForCausalLM.from_pretrained(model_id, use_auth_token=True) model
We can also use huggingface-cli login
at the terminal.
Deal with ModuleNotFoundError: No module named 'packaging'
when installing flash-attn
Do
pip3 install --upgrade pip setuptools wheel
pip3 install packaging
pip3 install flash-attn
The error will change to ModuleNotFoundError: No module named 'torch'
Now go to pytorch website to install torch
!
For Linux with Cuda OR Mac without Cuda:
pip3 install torch torchvision torchaudio
After that, if it complains:
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root
We need to install CUDA toolkit
Here is a related thread but it doesn’t seem to solve the problem in our case (and our instructions above worked).
Training from a remote server
monitoring
CPU
sudo apt install htop # For Debian/Ubuntu
sudo apt install glances # For Debian/Ubuntu
GPU
watch -n 1 nvidia-smi
fg and bg
- To start in background
&
- Put an already running job into background:
Ctrl + Z
to suspend thenbg
to resume in background
job persistence
Starts script.py in the background, makes it ignore the hangup signal, and redirects its output to output.log.
nohup python script.py > output.log &
- Persistence: Processes run with nohup will not terminate when the user logs out or when the terminal is closed. They ignore the HUP (hangup) signal.
- Output Management: By default, nohup redirects the standard output and standard error to a file called nohup.out if no output file is explicitly specified. This helps in saving the output for later review.
Get back the jobs
If we have not logged out, use jobs
to list the background jobs and fg %job_number
to bring it back.
If we already logged out and are running a nohup
job, use ps -aux | grep cmd_name
to find it.
Note that we can also use TMUX
to keep the jobs running in sessions.
SSH forwarding
In ~/.ssh/config
:
Host server
HostName host_ip or dns_name
Port xxx
User xxx
IdentityFile ~/.ssh/xxxx
IdentitiesOnly yes
LocalForward 8888 localhost:8888 # port forwarding for Jupyter Server
LocalForward 6006 localhost:6006 # port forwarding for Tensorboard
ServerAliveInterval 120
Then ssh server
Sometimes the prior sessions might not have released the local ports properly
bind [127.0.0.1]:xxxx: Address already in use
channel_setup_fwd_listener_tcpip: cannot listen to port: xxxx
We can check for the port usage:
lsof -i :xxxx
and stop the process
kill -9 PID
Sometimes we might need to restart ssh
sudo service ssh restart
Tensorboard
First make sure we installed Tensorboard!
pip install tensorboard
by default, our run will create a runs
directory under the output_dir
. We can then go to the output_dir
and start tensorboard on the server side by:
tensorboard --logdir=runs --bind_all
Assuming Tensorboard port forwarding as been set up as above, we can visit http://localhost:6006/
Set up Jupyter lab server for remote access
Server side:
jupyter lab --no-browser --ip=0.0.0.0 --port=8888 &
Client side
ssh -L 8888:localhost:8888 your_username@remote_server_address
To make SSH forwarding permanent, edit the ~/.ssh/config file.
LocalForward 8888 localhost:8888 # port forwarding for Jupyter Server
Then connect to http://localhost:8888
When asked for token, check out the server log to look for a line like: http://0.0.0.0:8888/?token=some_long_token_string
scp ‘user@server:/home/user/files/*.txt’ .
Model playground
E.g.
https://replicate.com/meta/meta-llama-3-8b
Techniques on Fine-tuning LLMs
Different ways for fine-tuning
- Native Pytorch example: fine-tune Llama2 with Alpaca
- Pytorch Lightning: example: fine-tune PaliGemma for image
- Transformers’ Trainer: example: FIM fine-tune StarCoder and Codellama for code co-pilot
- trl’s SFTTrainer: example: fine-tune Llama2 with Alpaca, example: chat fine-tune for code copilot
- axolotl: example: fine tune tinyllama
memory optimization for model training
Memory constraints: increasing batch size means increasing training throughput (samples/sec), reduces the training time. To fit more samples into memory, we can use memory optimization techniques. However, memory optimization techniques themselves slow down the training process. There is a balance sweet spot between batch size and memory optimiation techniques. This is part of hyper-parameter tuning.
Method/tool | Improves training speed | Optimizes memory utilization | Note |
---|---|---|---|
Batch size choice | Yes | Yes | |
Gradient accumulation | No | Yes | |
Gradient checkpointing | No (-20%) | Yes | |
Mixed precision training | Yes | (No) | |
Optimizer choice | Yes | Yes | |
Data preloading | Yes | No | |
DeepSpeed Zero | No | Yes | |
torch.compile | Yes | No | |
Parameter-Efficient Fine Tuning (PEFT) | No | Yes | |
Flash Attention 2 | Y | ? |
Source: https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one
Parameter tuning
Sequence length
- longer sequences increase attention cost
- In new mechanisms such as flash attention, the attention cost is relatively small so we can afford longer sequences. For really big networks, the cost of attention will be completely drown out by the cost of networks.
- for 30B, no penalty for sequence length 2048 to 4096, only 15% penalty for sequence 8192. for 7B, the penalty escalate pretty quickly.
- Data needs to justify using longer sequence length
- Many standardized datasets do not have many long sequences
- Will the model be able to take advantage of long sequence training?
Positional encodings determine which token is in which part, * Llama uses RoPE, another choice could be AliBi
Tokenizers
tokenizers such as BPE and Sentencepiece, convert text into numerical representations. In certain scenarios, it could be beneficial to train a custom tokenizer and establish a unique vocabulary. For instance, tokens in programming languages can significantly differ from those in natural languages, and tokens in foreign languages may vary greatly from those in English.
The size of the vocabulary can influence the efficiency of the model. A larger vocabulary may decrease computational efficiency due to the increased complexity. However, it can also enhance token efficiency, as a larger vocabulary can represent a piece of text with fewer tokens. Therefore, some of the computational efficiency can be regained. The optimal vocabulary size can also depend on the specific domain. A standard size often used is around 50,000, but sizes ranging from 25,000 to 100,000 have also been successfully utilized in production environments.
However, training a custom tokenizer is a complex task and it’s strongly recommended to compare the results with those from existing generic tokenizers. When we modify our tokenizer or even just adjust the vocabulary size, the loss values between two models may not be directly comparable.
Reference: * Weights & Biases course on LLM Fine-tuning Techniques with Jonathan Frankle of MosaicML * Huggingface page on Performance and Scalability covers latest techniques for both training and inference. * Karpathy has an excellent Tokenizer tutorial.
Evals
Evaluate each step! Eval should be part of the product specification, along with objectives.
Evaluation must serve a clear purpose and have a clear metric, even for (seemingly) less rigid tasks such as summarization. In a summarization task, ask the question what impact do we want the summary to create? If it is action items, then use them as the evaluation metric.
product metrics and eval metrics are related but not necessarily equal
Eval framework/library are helpful, but at the end real customer evaluations are what really matters.
Eval/assertions: - Code-based deterministic unit tests (e.g., pytest) - Human Judge - LLM as a Judge -> side-by-side, multiple judges, periodic random check for human judge alignment (e.g., model response, model critique, model decision, human critique, human decision, human revised response)
Break into different scenarios.
Use LLM to generate synthetic test data for eval.
Log testing results and analyze / visualize them.
Model comparison: A/B testing: randomly select models to serve and measures human ratings
Hamel’s blog on evals for fine-tuning and his step-by-step workflow example.
- run the result through an existing model, if it is a picture we can run it through GPT-4o
- L1 eval (assertions that remove invalid data samples). For example, if we are generating a coding language, we can check the validity of code syntax.
- synthetic sample data generation (e.g., with a highly capable model)
- data preprocessing
- training
- inference sanity check
- L2 eval (remove bad samples)
- iterative curation to improve the dataset quality
Inspect_ai
Allaire’s talk Inspect, An OSS framework for LLM evals, and slides.
$ git clone https://github.com/UKGovernmentBEIS/inspect_ai.git
$ cd inspect_ai
$ python3 -m venv venv
$ cd inspect_ai
$ pip install -e ".[dev]"
We need to set up the eval model as environment variable export INSPECT_EVAL_MODEL="openai/gpt-4o"
RAG is search based generation (SBG)
Evaluating RAG
Use real queries and check the search results. Does it retrieve the desired documents?
Before worrying about specific RAG techniques such as chunking, re-ranking and different types of indicing.
Frameworks: semantic or lexical retrieval scores are not necessarily calibrated. Don’t be over confident about the scores.
Agent
Planning: state-machine type of planning may be treated as a classifier for next step’s choices. Also evaluate quality of prompts generated during each stage if applicable.
Tasks: structured outputs
Don’t put too much context/everything to the final stage agent workflow. Use minimum context needed.
Step-by-step Agent evaluation: e.g. meeting notes summarizer:
- Extract key decisions, action items and owners and verify (classification problem with precision recall)
- Check factual consistency (classification )
- Rewrite into bulletin point summaries (writing, info density)
Production workflow
Expose endpoints at the production workflow stages directly for the eval! (Avoid drifts on a replicated system that doesn’t keep everything in sync)
Logging, trace and debugging
Event trace: sequence of events commonly saved to Jsonl format. May create some UI to visualize and process the trace logs (e.g., Shiny, Streamlit and Gradio) but could also use existing tools: - Commercial tools: Langsmith, W&B Weave, BrainTrust, Pydantic LogFire - OSS: Instruct, Open LLMetry
Run scheduled notebooks for eval tests!? Targeting evals that suceeds 60-70% times.
Prompt engineering
Few shot, chain of thought, transcripts/play scripts
Flow engineering:
Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering
ChainForge > ChainForge is an open-source visual programming environment for prompt engineering, LLM evaluation and experimentation.
Datasets
GSM8K (Grade School Math 8K) contains 8.5K human-created grade school math problems (7.5k training problems and 1k test problems).
Structured LLM Output format
direct prompting
avoid manually copy and paste code into separate files
Use the following to avoid having to manually copy and paste your LLM code into separate files:
Please create a single code block containing
cat << EOF
statements that I can copy/paste to create all those files
(Credit: Jeremy Howard)
get clean markdown format
Use the following to get clean markdown output:
Put all your output in a
markdown
block
libraries
Dataset
Clever way of generating dataset
annotation
Misc issues during LLM workflow
Bearer error
When working with LLM frameworks such as LlamaIndex along with Streamlit. We might encounter the following error:
LocalProtocolError: Illegal header value b'Bearer '
Often seen when the OpenAI API key was not found. If the key is in the .env
file, we can do:
from dotenv import load_dotenv
load_dotenv()assert os.getenv("OPENAI_API_KEY") is not None, "Please set the OPENAI_API_KEY environment variable"
Note that this only works on local environment. If a remote Github CI workflow is involved, we will need to use Github secrets in actions. We should also make sure that no local dependencies (e.g., test data) need to be accessed during the workflow unless they are made available.
OpenAI API
GPT building: Could not find a valid URL in servers
Sometimes when we add an Action to the Config GPT window, and use the “import URL” tool to import e.g., “https://…url/openapi.json” we may get this https://chatweb3.up.railway.app/openapi.json
. Examine the .json file, it is likely missing a line with the “server” url. We can compare it with the reference files generated by OpenAI’s ActionGPT. After adding this server line, it should work.
Assistant API: Error on uploading files for retrieval
400 - {'error': {'message': 'Files with extensions [none] are not supported for retrieval.
See discussion
Migrate OpenAI API to v1
poetry update openai# or
--upgrade openai
pip install
openai migrate
from openai.error import InvalidRequestError
-> from openai import BadRequestError
https://stackoverflow.com/questions/77820916/openai-api-error-modulenotfounderror-no-module-named-openai-error
https://github.com/openai/openai-python/discussions/742
Large Context Window LLMs
Resources
Hands-on Deep Learning and LLM fundamentals
The BEST & FREE courses:
- Neural Networks: Zero to Hero by Andrej Karpathy
- Practical Deep Learning for Coders Part 1 and Part 2 by Jeremy Howard
- Build a Large Language Model from scratch by Sebastian Raschka (Book)
Data processing tools
- URL to LLM-friendly input by Jina.ai
- Webpage to Markdown