Generative AI Notes – Dr. Charles Shen

Ubuntu GPU Server Sertup

Install Cuda

sudo ubuntu-drivers devices # Check driver information 
lspci | grep -i nvidia
sudo lshw -C display
nvcc -V
watch -n 1 nvidia-smi

CUDA Toolkit 12.4 Download

How to install CUDA on Ubuntu 22.04

Install Torch

Upgrade pip first

pip3 install --upgrade pip setuptools wheel

Install PyTorch

pip3 install torch torchvision torchaudio

If using Poetry instead of pip, try poetry env use python3.11 in case we encounter torch installation issues. Also try pip install to see which version is available to install.

Sometimes we encounter issues for poetry with torch related installation on Mac, and sometimes the order of installation matters. Of the latest test, poetry add sentence-transformers does work, which will install torch, transformers and sentence-transformers.

Install Unsloth

We will use Unsloth to fine tune Llama 3 on local GPUs.

Create the environment with torch and huggingface

python3.11 -m venv venv
pip install torch torchvision torchaudio
pip install ipython ipywidgets huggingface_hub

Install unsloth Check installation page for exact commands. E.g.,

# RTX 3090, 4090 Ampere GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes

Check the installation

nvcc -V
python -m xformers.info
python -m bitsandbytes

Install Wandb

wandb can be used to track progress:

pip install wandb
wandb login

Download gated models from Huggingface

In order to download gated models (e.g. Llama) from huggingface, we can

set the HF_TOKEN environmental variable.

export HF_TOKEN=hf_xxxxx
echo $HF_TOKEN

Or assign directly: os.environ["HF_TOKEN"]=hf_xxxxx

Then we can use it:

token = os.getenv("HF_TOKEN")
model = AutoPeftModelForCausalLM.from_pretrained(model_id, use_auth_token=True)

We can also use huggingface-cli login at the terminal.

Deal with `ModuleNotFoundError: No module named 'packaging'` when installing `flash-attn`

pip3 install --upgrade pip setuptools wheel
pip3 install packaging
pip3 install flash-attn

The error will change to ModuleNotFoundError: No module named 'torch'

Now go to pytorch website to install torch!

For Linux with Cuda OR Mac without Cuda:

pip3 install torch torchvision torchaudio

After that, if it complains:

OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root

We need to install CUDA toolkit

Here is a related thread but it doesn’t seem to solve the problem in our case (and our instructions above worked).

Training from a remote server

monitoring

CPU

sudo apt install htop  # For Debian/Ubuntu
sudo apt install glances # For Debian/Ubuntu

GPU

watch -n 1 nvidia-smi

fg and bg

To start in background &
Put an already running job into background: Ctrl + Z to suspend then bg to resume in background

job persistence

Starts script.py in the background, makes it ignore the hangup signal, and redirects its output to output.log.

nohup python script.py > output.log &

Persistence: Processes run with nohup will not terminate when the user logs out or when the terminal is closed. They ignore the HUP (hangup) signal.
Output Management: By default, nohup redirects the standard output and standard error to a file called nohup.out if no output file is explicitly specified. This helps in saving the output for later review.

Get back the jobs

If we have not logged out, use jobs to list the background jobs and fg %job_number to bring it back.

If we already logged out and are running a nohup job, use ps -aux | grep cmd_name to find it.

Note that we can also use TMUX to keep the jobs running in sessions.

SSH forwarding

In ~/.ssh/config:

Host server
  HostName host_ip or dns_name
  Port xxx
  User xxx
  IdentityFile ~/.ssh/xxxx
  IdentitiesOnly yes
  LocalForward 8888 localhost:8888 # port forwarding for Jupyter Server
  LocalForward 6006 localhost:6006 # port forwarding for Tensorboard
  ServerAliveInterval 120

Then ssh server

Sometimes the prior sessions might not have released the local ports properly

bind [127.0.0.1]:xxxx: Address already in use
channel_setup_fwd_listener_tcpip: cannot listen to port: xxxx

We can check for the port usage:

lsof -i :xxxx

and stop the process

kill -9 PID

Sometimes we might need to restart ssh

sudo service ssh restart

Tensorboard

First make sure we installed Tensorboard!

pip install tensorboard

by default, our run will create a runs directory under the output_dir. We can then go to the output_dir and start tensorboard on the server side by:

tensorboard --logdir=runs --bind_all

Assuming Tensorboard port forwarding as been set up as above, we can visit http://localhost:6006/

Set up Jupyter lab server for remote access

Server side:

jupyter lab --no-browser --ip=0.0.0.0 --port=8888 &

Client side

ssh -L 8888:localhost:8888 your_username@remote_server_address

To make SSH forwarding permanent, edit the ~/.ssh/config file.

LocalForward 8888 localhost:8888 # port forwarding for Jupyter Server

Then connect to http://localhost:8888

When asked for token, check out the server log to look for a line like: http://0.0.0.0:8888/?token=some_long_token_string

scp ‘user@server:/home/user/files/*.txt’ .

Model playground

E.g.

https://replicate.com/meta/meta-llama-3-8b

Techniques on Fine-tuning LLMs

Different ways for fine-tuning

Native Pytorch example: fine-tune Llama2 with Alpaca
Pytorch Lightning: example: fine-tune PaliGemma for image
Transformers’ Trainer: example: FIM fine-tune StarCoder and Codellama for code co-pilot
trl’s SFTTrainer: example: fine-tune Llama2 with Alpaca, example: chat fine-tune for code copilot
axolotl: example: fine tune tinyllama

memory optimization for model training

Memory constraints: increasing batch size means increasing training throughput (samples/sec), reduces the training time. To fit more samples into memory, we can use memory optimization techniques. However, memory optimization techniques themselves slow down the training process. There is a balance sweet spot between batch size and memory optimiation techniques. This is part of hyper-parameter tuning.

Method/tool	Improves training speed	Optimizes memory utilization
Batch size choice	Yes	Yes
Gradient accumulation	No	Yes
Gradient checkpointing	No (-20%)	Yes
Mixed precision training	Yes	(No)
Optimizer choice	Yes	Yes
Data preloading	Yes	No
DeepSpeed Zero	No	Yes
torch.compile	Yes	No
Parameter-Efficient Fine Tuning (PEFT)	No	Yes
Flash Attention 2	Y	?

Source: https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one

Parameter tuning

Sequence length

longer sequences increase attention cost
- In new mechanisms such as flash attention, the attention cost is relatively small so we can afford longer sequences. For really big networks, the cost of attention will be completely drown out by the cost of networks.
- for 30B, no penalty for sequence length 2048 to 4096, only 15% penalty for sequence 8192. for 7B, the penalty escalate pretty quickly.
Data needs to justify using longer sequence length
- Many standardized datasets do not have many long sequences
Will the model be able to take advantage of long sequence training?

Positional encodings determine which token is in which part, * Llama uses RoPE, another choice could be AliBi

Tokenizers

tokenizers such as BPE and Sentencepiece, convert text into numerical representations. In certain scenarios, it could be beneficial to train a custom tokenizer and establish a unique vocabulary. For instance, tokens in programming languages can significantly differ from those in natural languages, and tokens in foreign languages may vary greatly from those in English.

The size of the vocabulary can influence the efficiency of the model. A larger vocabulary may decrease computational efficiency due to the increased complexity. However, it can also enhance token efficiency, as a larger vocabulary can represent a piece of text with fewer tokens. Therefore, some of the computational efficiency can be regained. The optimal vocabulary size can also depend on the specific domain. A standard size often used is around 50,000, but sizes ranging from 25,000 to 100,000 have also been successfully utilized in production environments.

However, training a custom tokenizer is a complex task and it’s strongly recommended to compare the results with those from existing generic tokenizers. When we modify our tokenizer or even just adjust the vocabulary size, the loss values between two models may not be directly comparable.

Reference: * Weights & Biases course on LLM Fine-tuning Techniques with Jonathan Frankle of MosaicML * Huggingface page on Performance and Scalability covers latest techniques for both training and inference. * Karpathy has an excellent Tokenizer tutorial.

Evals

Evaluate each step! Eval should be part of the product specification, along with objectives.

Evaluation must serve a clear purpose and have a clear metric, even for (seemingly) less rigid tasks such as summarization. In a summarization task, ask the question what impact do we want the summary to create? If it is action items, then use them as the evaluation metric.

product metrics and eval metrics are related but not necessarily equal

Eval framework/library are helpful, but at the end real customer evaluations are what really matters.

Eval/assertions: - Code-based deterministic unit tests (e.g., pytest) - Human Judge - LLM as a Judge -> side-by-side, multiple judges, periodic random check for human judge alignment (e.g., model response, model critique, model decision, human critique, human decision, human revised response)

Break into different scenarios.

Use LLM to generate synthetic test data for eval.

Log testing results and analyze / visualize them.

Model comparison: A/B testing: randomly select models to serve and measures human ratings

Hamel’s blog on evals for fine-tuning and his step-by-step workflow example.

run the result through an existing model, if it is a picture we can run it through GPT-4o
L1 eval (assertions that remove invalid data samples). For example, if we are generating a coding language, we can check the validity of code syntax.
synthetic sample data generation (e.g., with a highly capable model)
data preprocessing
training
inference sanity check
L2 eval (remove bad samples)
iterative curation to improve the dataset quality

Inspect_ai

inspect_ai

Allaire’s talk Inspect, An OSS framework for LLM evals, and slides.

$ git clone https://github.com/UKGovernmentBEIS/inspect_ai.git
$ cd inspect_ai
$ python3 -m venv venv
$ cd inspect_ai
$ pip install -e ".[dev]"

We need to set up the eval model as environment variable export INSPECT_EVAL_MODEL="openai/gpt-4o"

RAG is search based generation (SBG)

Evaluating RAG

Use real queries and check the search results. Does it retrieve the desired documents?

Before worrying about specific RAG techniques such as chunking, re-ranking and different types of indicing.

Frameworks: semantic or lexical retrieval scores are not necessarily calibrated. Don’t be over confident about the scores.

Agent

Planning: state-machine type of planning may be treated as a classifier for next step’s choices. Also evaluate quality of prompts generated during each stage if applicable.

Tasks: structured outputs

Don’t put too much context/everything to the final stage agent workflow. Use minimum context needed.

Step-by-step Agent evaluation: e.g. meeting notes summarizer:

Extract key decisions, action items and owners and verify (classification problem with precision recall)
Check factual consistency (classification )
Rewrite into bulletin point summaries (writing, info density)

Production workflow

Expose endpoints at the production workflow stages directly for the eval! (Avoid drifts on a replicated system that doesn’t keep everything in sync)

Logging, trace and debugging

Event trace: sequence of events commonly saved to Jsonl format. May create some UI to visualize and process the trace logs (e.g., Shiny, Streamlit and Gradio) but could also use existing tools: - Commercial tools: Langsmith, W&B Weave, BrainTrust, Pydantic LogFire - OSS: Instruct, Open LLMetry

Run scheduled notebooks for eval tests!? Targeting evals that suceeds 60-70% times.

Prompt engineering

Few shot, chain of thought, transcripts/play scripts

Flow engineering:

Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering

ChainForge > ChainForge is an open-source visual programming environment for prompt engineering, LLM evaluation and experimentation.

Datasets

GSM8K (Grade School Math 8K) contains 8.5K human-created grade school math problems (7.5k training problems and 1k test problems).

Structured LLM Output format

direct prompting

avoid manually copy and paste code into separate files

Use the following to avoid having to manually copy and paste your LLM code into separate files:

Please create a single code block containing cat << EOF statements that I can copy/paste to create all those files

(Credit: Jeremy Howard)

get clean markdown format

Use the following to get clean markdown output:

Put all your output in a markdown block

libraries

Dataset

Clever way of generating dataset

Enhancing novel writing DPO dataset

annotation

argilla

Misc issues during LLM workflow

Bearer error

When working with LLM frameworks such as LlamaIndex along with Streamlit. We might encounter the following error:

LocalProtocolError: Illegal header value b'Bearer '

Often seen when the OpenAI API key was not found. If the key is in the .env file, we can do:

from dotenv import load_dotenv
load_dotenv()
assert os.getenv("OPENAI_API_KEY") is not None, "Please set the OPENAI_API_KEY environment variable"

Note that this only works on local environment. If a remote Github CI workflow is involved, we will need to use Github secrets in actions. We should also make sure that no local dependencies (e.g., test data) need to be accessed during the workflow unless they are made available.

OpenAI API

OpenAI models

GPT building: Could not find a valid URL in `servers`

Sometimes when we add an Action to the Config GPT window, and use the “import URL” tool to import e.g., “https://…url/openapi.json” we may get this https://chatweb3.up.railway.app/openapi.json. Examine the .json file, it is likely missing a line with the “server” url. We can compare it with the reference files generated by OpenAI’s ActionGPT. After adding this server line, it should work.

Assistant API: Error on uploading files for retrieval

400 - {'error': {'message': 'Files with extensions [none] are not supported for retrieval. See discussion

Migrate OpenAI API to v1

poetry update openai
# or
pip install --upgrade openai

openai migrate

from openai.error import InvalidRequestError -> from openai import BadRequestError

https://stackoverflow.com/questions/77820916/openai-api-error-modulenotfounderror-no-module-named-openai-error

https://github.com/openai/openai-python/discussions/742

Large Context Window LLMs

Repository to Prompt

Resources

Hands-on Deep Learning and LLM fundamentals

The BEST & FREE courses:

Neural Networks: Zero to Hero by Andrej Karpathy
Practical Deep Learning for Coders Part 1 and Part 2 by Jeremy Howard
Build a Large Language Model from scratch by Sebastian Raschka (Book)

Data processing tools

URL to LLM-friendly input by Jina.ai
Webpage to Markdown