GPU Server – Dr. Charles Shen

Ubuntu GPU Server Sertup

Install Cuda

sudo ubuntu-drivers devices # Check driver information 
lspci | grep -i nvidia
sudo lshw -C display
nvcc -V
watch -n 1 nvidia-smi

CUDA Toolkit 12.4 Download

How to install CUDA on Ubuntu 22.04

Install Torch

Upgrade pip first

pip3 install --upgrade pip setuptools wheel

Install PyTorch

pip3 install torch torchvision torchaudio

If using Poetry instead of pip, try poetry env use python3.11 in case we encounter torch installation issues. Also try pip install to see which version is available to install.

Sometimes we encounter issues for poetry with torch related installation on Mac, and sometimes the order of installation matters. Of the latest test, poetry add sentence-transformers does work, which will install torch, transformers and sentence-transformers.

Install Unsloth

We will use Unsloth to fine tune Llama 3 on local GPUs.

Create the environment with torch and huggingface

python3.11 -m venv venv
pip install torch torchvision torchaudio
pip install ipython ipywidgets huggingface_hub

Install unsloth Check installation page for exact commands. E.g.,

# RTX 3090, 4090 Ampere GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes

Check the installation

nvcc -V
python -m xformers.info
python -m bitsandbytes

Install Wandb

wandb can be used to track progress:

pip install wandb
wandb login

Download gated models from Huggingface

In order to download gated models (e.g. Llama) from huggingface, we can

set the HF_TOKEN environmental variable.

export HF_TOKEN=hf_xxxxx
echo $HF_TOKEN

Or assign directly: os.environ["HF_TOKEN"]=hf_xxxxx

Then we can use it:

token = os.getenv("HF_TOKEN")
model = AutoPeftModelForCausalLM.from_pretrained(model_id, use_auth_token=True)

We can also use huggingface-cli login at the terminal.

Deal with `ModuleNotFoundError: No module named 'packaging'` when installing `flash-attn`

pip3 install --upgrade pip setuptools wheel
pip3 install packaging
pip3 install flash-attn

The error will change to ModuleNotFoundError: No module named 'torch'

Now go to pytorch website to install torch!

For Linux with Cuda OR Mac without Cuda:

pip3 install torch torchvision torchaudio

After that, if it complains:

OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root

We need to install CUDA toolkit

Here is a related thread but it doesn’t seem to solve the problem in our case (and our instructions above worked).

Training from a remote server

monitoring

CPU

sudo apt install htop  # For Debian/Ubuntu
sudo apt install glances # For Debian/Ubuntu

GPU

watch -n 1 nvidia-smi

fg and bg

To start in background &
Put an already running job into background: Ctrl + Z to suspend then bg to resume in background

job persistence

Starts script.py in the background, makes it ignore the hangup signal, and redirects its output to output.log.

nohup python script.py > output.log &

Persistence: Processes run with nohup will not terminate when the user logs out or when the terminal is closed. They ignore the HUP (hangup) signal.
Output Management: By default, nohup redirects the standard output and standard error to a file called nohup.out if no output file is explicitly specified. This helps in saving the output for later review.

Get back the jobs

If we have not logged out, use jobs to list the background jobs and fg %job_number to bring it back.

If we already logged out and are running a nohup job, use ps -aux | grep cmd_name to find it.

Note that we can also use TMUX to keep the jobs running in sessions.

SSH forwarding

In ~/.ssh/config:

Host server
  HostName host_ip or dns_name
  Port xxx
  User xxx
  IdentityFile ~/.ssh/xxxx
  IdentitiesOnly yes
  LocalForward 8888 localhost:8888 # port forwarding for Jupyter Server
  LocalForward 6006 localhost:6006 # port forwarding for Tensorboard
  ServerAliveInterval 120

Then ssh server

Sometimes the prior sessions might not have released the local ports properly

bind [127.0.0.1]:xxxx: Address already in use
channel_setup_fwd_listener_tcpip: cannot listen to port: xxxx

We can check for the port usage:

lsof -i :xxxx

and stop the process

kill -9 PID

Sometimes we might need to restart ssh

sudo service ssh restart

Tensorboard

First make sure we installed Tensorboard!

pip install tensorboard

by default, our run will create a runs directory under the output_dir. We can then go to the output_dir and start tensorboard on the server side by:

tensorboard --logdir=runs --bind_all

Assuming Tensorboard port forwarding as been set up as above, we can visit http://localhost:6006/

Set up Jupyter lab server for remote access

Server side:

jupyter lab --no-browser --ip=0.0.0.0 --port=8888 &

Client side

ssh -L 8888:localhost:8888 your_username@remote_server_address

To make SSH forwarding permanent, edit the ~/.ssh/config file.

LocalForward 8888 localhost:8888 # port forwarding for Jupyter Server

Then connect to http://localhost:8888

When asked for token, check out the server log to look for a line like: http://0.0.0.0:8888/?token=some_long_token_string

scp ‘user@server:/home/user/files/*.txt’ .

Model playground

E.g.

https://replicate.com/meta/meta-llama-3-8b