Running Qwen 35B on a 6GB GPU: The Reality of Local AI

Running massive language models locally used to require server farms. Today, clever quantization and memory offloading let you run a 35 billion parameter model on an old gaming PC. Here is how.

When Alibaba released Qwen 2.5, specifically the 32B and 35B variants, the AI community was stunned by its performance. It rivaled models twice its size. But there was a catch: running a 35 billion parameter model in FP16 (16-bit precision) requires roughly 70GB of VRAM. For context, an RTX 4090 only has 24GB. So how do you run it on an old GPU with just 6GB of VRAM?

The Magic of Quantization

The secret lies in quantization—specifically GGUF (GPT-Generated Unified Format). By compressing the model's weights from 16-bit to 4-bit or even 3-bit precision, we can drastically shrink the memory footprint. A 4-bit Qwen 35B model takes up about 20GB of space.

VRAM + System RAM Offloading

Even at 20GB, it won't fit into a 6GB GPU. This is where tools like LM Studio or Ollama (running llama.cpp under the hood) come into play. They allow you to "split" the model. You load the most critical layers into your 6GB of fast GPU VRAM, and offload the rest into your slower system RAM (CPU memory).

The Setup:

Model: Qwen 2.5 32B (Q4_K_M GGUF format)
Hardware: NVIDIA GTX 1060 (6GB) / 32GB DDR4 System RAM
Software: LM Studio
Settings: GPU Offload set to exactly the number of layers that fill ~5.5GB of VRAM.

The Performance Trade-off

Running a model partially on system RAM means your token generation speed (tokens per second) will drop. Instead of seeing 50+ tokens/sec, you might get 3 to 5 tokens/sec. It feels like watching someone type quickly. While it's not ideal for real-time applications, it is more than enough for offline coding assistance, deep research, and local data analysis without sending sensitive information to the cloud.

"The fact that a 6GB GPU from 2016 can run a model capable of passing the bar exam is a testament to the open-source AI engineering community."