Docs
Everything you need to understand parameters, memory, speed, and how Will It Run AI grades models for your hardware.
Section 1
A model's parameters are the learned weights that encode its knowledge. More parameters generally means better quality, but also more memory to load and run the model.
Dense models activate every parameter on every token. A 70B dense model uses all 70 billion parameters for each prediction.
Mixture-of-Experts (MoE) models have a larger total parameter count but only activate a subset per token. For example, Mistral Small 4 is 119B total but only 41B active per token, and Qwen3 Coder 30B-A3B has 30B total but only 3B active. This means MoE models need memory for all weights but compute only scales with active parameters.
Why it matters for VRAM: you need enough memory to hold all parameters (total, not active), plus KV cache and runtime overhead. The active parameter count mostly affects speed, not memory.
Section 2
Quantization reduces the precision of model weights to save memory. Instead of storing each weight as a 16-bit float (~2 bytes), you can compress them to 4, 3, or even 2 bits. The tradeoff is a small loss in output quality.
| Scheme | Bits | GB per Billion Params | Quality |
|---|---|---|---|
| F16 / BF16 | 16 | 2.00 | Reference (lossless) |
| Q8_0 | 8 | ~1.02 | Near-lossless |
| Q6_K | 6 | ~0.78 | Excellent |
| Q5_K_M | 5 | ~0.68 | Very good |
| Q4_K_M | 4 | ~0.58 | Good (sweet spot) |
| Q3_K_M | 3 | ~0.46 | Acceptable |
| Q2_K | 2 | ~0.34 | Noticeable degradation |
Sweet spot: Q4_K_M offers the best balance of quality and memory for most use cases. Use Q5_K_M or Q6_K when you have headroom and need higher fidelity. Q3_K_M is acceptable for very large models that otherwise would not fit.
Section 3
The core formula for memory needed to run a model:
VRAM needed ≈ params × gbPerBillion + KV cache + runtime overhead + headroomGPU VRAM is dedicated video memory. An RTX 4090 has 24 GB, an RTX 5090 has 32 GB, and an A100 has 80 GB. What you see is what you get.
Apple Silicon unified memory is shared between the CPU, GPU, and system. A Mac with 128 GB unified memory does not have 128 GB available for model weights. The OS, apps, and GPU driver take a cut. We estimate roughly 72% of unified memory is available for inference.
Why headroom matters: beyond raw model weights, you need memory for the KV cache (which grows with context window length), runtime overhead (Ollama, llama.cpp, etc.), and a safety buffer to avoid OOM crashes. Longer context windows require significantly more KV cache memory.
Interactive tool
Select a model size, quantization level, and GPU to see if it fits — and which hardware options work.
7B @ Q4_K_M → 6.1 GB needed
4.3 GB weights + 0.8 GB KV cache + 1 GB runtime
GPUs that can run this:
Section 4
Memory bandwidth determines how fast you can generate tokens, more than raw compute power. LLM inference is memory-bound: each token requires reading the entire model weights from memory.
decode tok/s ≈ (bandwidth / weightsGB) × efficiencyGPU bandwidth examples: RTX 4090 has 1,008 GB/s, RTX 5090 has 1,792 GB/s, A100 80GB has 2,039 GB/s. Higher bandwidth means faster token generation for the same model size.
Apple Silicon's advantage: Mac chips offer high memory bandwidth relative to their total memory. An M4 Max provides 546 GB/s with up to 128 GB of unified memory, making it competitive for large models that would not fit in a discrete GPU's VRAM.
Section 5
Easiest way to get started. One-command install, automatic model management, built-in API server. Uses llama.cpp under the hood.
Best for: Getting started, simple local use
The underlying inference engine for most local AI tools. Most flexible, supports advanced features like grammar-constrained generation and speculative decoding.
Best for: Power users, custom setups, maximum control
GUI-based model manager and chat interface. Good for exploring models, comparing outputs, and non-technical users.
Best for: Exploration, visual interface, quick testing
Production-grade serving engine with batched inference, continuous batching, and PagedAttention. Optimized for throughput over latency.
Best for: Production serving, multi-user, high throughput
Apple's native ML framework for Apple Silicon. Best performance on Macs with unified memory, leveraging Metal GPU acceleration.
Best for: Apple Silicon Macs, best Mac performance
High-performance GPTQ/EXL2 inference. Excellent quantization quality with fast inference on NVIDIA GPUs.
Best for: NVIDIA GPUs with EXL2/GPTQ quants
Section 6
Will It Run AI assigns letter grades based on a 0-100 score that combines fit status (does it fit in memory?) with a speed bonus (how fast will it generate tokens?).
Excellent
Runs fast with comfortable headroom. No compromises.
Great
Runs well with minor tradeoffs. Slightly less headroom or speed.
Good
Usable but with some limitations. May be tight on memory or slower.
Usable
Significant tradeoffs. Works but expect slower speeds or limited context.
Poor
Barely functional. Heavy offloading or very slow generation.
Won't run
Does not fit. The model is too large for this hardware.
Speed bonus formula
speedBonus = min(25, sqrt(decodeTps) × 3.5)The speed bonus rewards faster hardware and smaller models, adding up to 25 points on top of the base fit score. A model decoding at 50 tok/s gets ~24.7 bonus points; one at 10 tok/s gets ~11.1 points.
Section 7
Estimates whether a model fits on hardware by computing total memory needed (weights + KV cache + runtime overhead + headroom) against available VRAM or unified memory. Classifies into native_fit, tight_fit, hybrid_fit, unsafe_fit, or no_fit based on memory utilization ratio.
Ranks models for a given hardware and workload combination using multi-factor scoring: quality tier, workload specialty match, model freshness, fit status, parameter fit for workload, memory utilization efficiency, decode speed, latency, and context alignment.
Recommendations are resolved against known artifacts (GGUF, SafeTensors, EXL2, etc.) and runtime capabilities. The system knows which runtimes support which formats and quantization schemes for each hardware backend.
The Rust API includes an evidence confidence endpoint that scores claims based on source reliability, recency, and corroboration. This powers future features like community-reported benchmarks and verified compatibility reports.