When Alibaba released Qwen 2.5, specifically the 32B and 35B variants, the AI community was stunned by its performance. It rivaled models twice its size. But there was a catch: running a 35 billion parameter model in FP16 (16-bit precision) requires roughly 70GB of VRAM. For context, an RTX 4090 only has 24GB. So how do you run it on an old GPU with just 6GB of VRAM?
The secret lies in quantization—specifically GGUF (GPT-Generated Unified Format). By compressing the model's weights from 16-bit to 4-bit or even 3-bit precision, we can drastically shrink the memory footprint. A 4-bit Qwen 35B model takes up about 20GB of space.
Even at 20GB, it won't fit into a 6GB GPU. This is where tools like LM Studio or Ollama (running llama.cpp under the hood) come into play. They allow you to "split" the model. You load the most critical layers into your 6GB of fast GPU VRAM, and offload the rest into your slower system RAM (CPU memory).
The Setup:
Running a model partially on system RAM means your token generation speed (tokens per second) will drop. Instead of seeing 50+ tokens/sec, you might get 3 to 5 tokens/sec. It feels like watching someone type quickly. While it's not ideal for real-time applications, it is more than enough for offline coding assistance, deep research, and local data analysis without sending sensitive information to the cloud.
"The fact that a 6GB GPU from 2016 can run a model capable of passing the bar exam is a testament to the open-source AI engineering community."