Ubuntu GPU Server Sertup
Install Cuda
sudo ubuntu-drivers devices # Check driver information
lspci | grep -i nvidia
sudo lshw -C display
nvcc -V
watch -n 1 nvidia-smi
Install Torch
Upgrade pip first
pip3 install --upgrade pip setuptools wheel
pip3 install torch torchvision torchaudio
If using Poetry
instead of pip
, try poetry env use python3.11
in case we encounter torch installation issues. Also try pip install
to see which version is available to install.
Sometimes we encounter issues for poetry
with torch
related installation on Mac, and sometimes the order of installation matters. Of the latest test, poetry add sentence-transformers
does work, which will install torch
, transformers
and sentence-transformers
.
Install Unsloth
We will use Unsloth to fine tune Llama 3 on local GPUs.
Create the environment with torch and huggingface
python3.11 -m venv venv
pip install torch torchvision torchaudio
pip install ipython ipywidgets huggingface_hub
Install unsloth Check installation page for exact commands. E.g.,
# RTX 3090, 4090 Ampere GPUs:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
Check the installation
nvcc -V
python -m xformers.info
python -m bitsandbytes
Install Wandb
wandb
can be used to track progress:
pip install wandb
wandb login
Download gated models from Huggingface
In order to download gated models (e.g. Llama) from huggingface, we can
- set the
HF_TOKEN
environmental variable.
export HF_TOKEN=hf_xxxxx
echo $HF_TOKEN
- Or assign directly:
os.environ["HF_TOKEN"]=hf_xxxxx
Then we can use it:
= os.getenv("HF_TOKEN")
token = AutoPeftModelForCausalLM.from_pretrained(model_id, use_auth_token=True) model
We can also use huggingface-cli login
at the terminal.
Deal with ModuleNotFoundError: No module named 'packaging'
when installing flash-attn
Do
pip3 install --upgrade pip setuptools wheel
pip3 install packaging
pip3 install flash-attn
The error will change to ModuleNotFoundError: No module named 'torch'
Now go to pytorch website to install torch
!
For Linux with Cuda OR Mac without Cuda:
pip3 install torch torchvision torchaudio
After that, if it complains:
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root
We need to install CUDA toolkit
Here is a related thread but it doesn’t seem to solve the problem in our case (and our instructions above worked).
Training from a remote server
monitoring
CPU
sudo apt install htop # For Debian/Ubuntu
sudo apt install glances # For Debian/Ubuntu
GPU
watch -n 1 nvidia-smi
fg and bg
- To start in background
&
- Put an already running job into background:
Ctrl + Z
to suspend thenbg
to resume in background
job persistence
Starts script.py in the background, makes it ignore the hangup signal, and redirects its output to output.log.
nohup python script.py > output.log &
- Persistence: Processes run with nohup will not terminate when the user logs out or when the terminal is closed. They ignore the HUP (hangup) signal.
- Output Management: By default, nohup redirects the standard output and standard error to a file called nohup.out if no output file is explicitly specified. This helps in saving the output for later review.
Get back the jobs
If we have not logged out, use jobs
to list the background jobs and fg %job_number
to bring it back.
If we already logged out and are running a nohup
job, use ps -aux | grep cmd_name
to find it.
Note that we can also use TMUX
to keep the jobs running in sessions.
SSH forwarding
In ~/.ssh/config
:
Host server
HostName host_ip or dns_name
Port xxx
User xxx
IdentityFile ~/.ssh/xxxx
IdentitiesOnly yes
LocalForward 8888 localhost:8888 # port forwarding for Jupyter Server
LocalForward 6006 localhost:6006 # port forwarding for Tensorboard
ServerAliveInterval 120
Then ssh server
Sometimes the prior sessions might not have released the local ports properly
bind [127.0.0.1]:xxxx: Address already in use
channel_setup_fwd_listener_tcpip: cannot listen to port: xxxx
We can check for the port usage:
lsof -i :xxxx
and stop the process
kill -9 PID
Sometimes we might need to restart ssh
sudo service ssh restart
Tensorboard
First make sure we installed Tensorboard!
pip install tensorboard
by default, our run will create a runs
directory under the output_dir
. We can then go to the output_dir
and start tensorboard on the server side by:
tensorboard --logdir=runs --bind_all
Assuming Tensorboard port forwarding as been set up as above, we can visit http://localhost:6006/
Set up Jupyter lab server for remote access
Server side:
jupyter lab --no-browser --ip=0.0.0.0 --port=8888 &
Client side
ssh -L 8888:localhost:8888 your_username@remote_server_address
To make SSH forwarding permanent, edit the ~/.ssh/config file.
LocalForward 8888 localhost:8888 # port forwarding for Jupyter Server
Then connect to http://localhost:8888
When asked for token, check out the server log to look for a line like: http://0.0.0.0:8888/?token=some_long_token_string
scp ‘user@server:/home/user/files/*.txt’ .
Model playground
E.g.
https://replicate.com/meta/meta-llama-3-8b