llama-b7885-bin-ubuntu-x64.tar.gz — basic / most compatiblellama-b7885-bin-ubuntu-vulkan-x64.tar.gz — if you want Vulkan GPU support
mkdir -p ~/llama.cpp && cd ~/llama.cpp
wget https://github.com/ggml-org/llama.cpp/releases/download/b7885/llama-b7885-bin-ubuntu-x64.tar.gz
tar -xzf llama-b7885-bin-ubuntu-x64.tar.gz
./llama-cli --version
sudo apt update
sudo apt install -y git build-essential cmake ninja-build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make -j$(nproc)
# Or with CMake:
# cmake -B build -DCMAKE_BUILD_TYPE=Release
# cmake --build build --config Release -j$(nproc)
Install CUDA toolkit first (adjust version as needed):
# NVIDIA driver if needed
sudo ubuntu-drivers autoinstall
# CUDA repo setup
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-6
nvcc --version
Then build llama.cpp:
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=all-major
cmake --build build --config Release -j$(nproc)
Run with -ngl 35 or higher to use GPU.
Download a GGUF model (examples from Hugging Face):
Interactive example:
./llama-cli -m models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
--color --temp 0.7 --repeat_penalty 1.1 -c 8192 -n -1 -ngl 35 \
-p "You are a helpful AI assistant."
API server mode:
./llama-server -m models/....gguf --host 0.0.0.0 --port 8080 -ngl 40 -c 32768