Docs

How local AI works.

Everything you need to understand parameters, memory, speed, and how Will It Run AI grades models for your hardware.

Section 1

Parameters

A model's parameters are the learned weights that encode its knowledge. More parameters generally means better quality, but also more memory to load and run the model.

Dense models activate every parameter on every token. A 70B dense model uses all 70 billion parameters for each prediction.

Mixture-of-Experts (MoE) models have a larger total parameter count but only activate a subset per token. For example, Mistral Small 4 is 119B total but only 41B active per token, and Qwen3 Coder 30B-A3B has 30B total but only 3B active. This means MoE models need memory for all weights but compute only scales with active parameters.

Why it matters for VRAM: you need enough memory to hold all parameters (total, not active), plus KV cache and runtime overhead. The active parameter count mostly affects speed, not memory.

Section 2

Quantization

Quantization reduces the precision of model weights to save memory. Instead of storing each weight as a 16-bit float (~2 bytes), you can compress them to 4, 3, or even 2 bits. The tradeoff is a small loss in output quality.

Scheme	Bits	GB per Billion Params	Quality
F16 / BF16	16	2.00	Reference (lossless)
Q8_0	8	~1.02	Near-lossless
Q6_K	6	~0.78	Excellent
Q5_K_M	5	~0.68	Very good
Q4_K_M	4	~0.58	Good (sweet spot)
Q3_K_M	3	~0.46	Acceptable
Q2_K	2	~0.34	Noticeable degradation

Sweet spot: Q4_K_M offers the best balance of quality and memory for most use cases. Use Q5_K_M or Q6_K when you have headroom and need higher fidelity. Q3_K_M is acceptable for very large models that otherwise would not fit.

Section 3

VRAM & Memory

The core formula for memory needed to run a model:

VRAM needed ≈ params × gbPerBillion + KV cache + runtime overhead + headroom

GPU VRAM is dedicated video memory. An RTX 4090 has 24 GB, an RTX 5090 has 32 GB, and an A100 has 80 GB. What you see is what you get.

Apple Silicon unified memory is shared between the CPU, GPU, and system. A Mac with 128 GB unified memory does not have 128 GB available for model weights. The OS, apps, and GPU driver take a cut. We estimate roughly 72% of unified memory is available for inference.

Why headroom matters: beyond raw model weights, you need memory for the KV cache (which grows with context window length), runtime overhead (Ollama, llama.cpp, etc.), and a safety buffer to avoid OOM crashes. Longer context windows require significantly more KV cache memory.

Interactive tool

VRAM Calculator

Select a model size, quantization level, and GPU to see if it fits — and which hardware options work.

Model size

billion params

Quantization

Available VRAM

7B @ Q4_K_M → 6.1 GB needed

4.3 GB weights + 0.8 GB KV cache + 1 GB runtime

SExcellent

VRAM usage25% of 24 GB

GPUs that can run this:

RTX 3060 (12 GB)RTX 4060 Ti (16 GB)RTX 4070 Ti (16 GB)RTX 4080 (16 GB)RTX 4090 (24 GB)RTX 5090 (32 GB)Mac M1 16 GB (~12 GB)Mac M2 Pro 32 GB (~23 GB)Mac M3 Max 64 GB (~46 GB)Mac M4 Max 128 GB (~92 GB)A100 (80 GB)H100 (80 GB)

Section 4

Bandwidth

Memory bandwidth determines how fast you can generate tokens, more than raw compute power. LLM inference is memory-bound: each token requires reading the entire model weights from memory.

decode tok/s ≈ (bandwidth / weightsGB) × efficiency

GPU bandwidth examples: RTX 4090 has 1,008 GB/s, RTX 5090 has 1,792 GB/s, A100 80GB has 2,039 GB/s. Higher bandwidth means faster token generation for the same model size.

Apple Silicon's advantage: Mac chips offer high memory bandwidth relative to their total memory. An M4 Max provides 546 GB/s with up to 128 GB of unified memory, making it competitive for large models that would not fit in a discrete GPU's VRAM.

Section 5

Runtimes

Ollama

Easiest way to get started. One-command install, automatic model management, built-in API server. Uses llama.cpp under the hood.

Best for: Getting started, simple local use

llama.cpp

The underlying inference engine for most local AI tools. Most flexible, supports advanced features like grammar-constrained generation and speculative decoding.

Best for: Power users, custom setups, maximum control

LM Studio

GUI-based model manager and chat interface. Good for exploring models, comparing outputs, and non-technical users.

Best for: Exploration, visual interface, quick testing

vLLM

Production-grade serving engine with batched inference, continuous batching, and PagedAttention. Optimized for throughput over latency.

Best for: Production serving, multi-user, high throughput

MLX

Apple's native ML framework for Apple Silicon. Best performance on Macs with unified memory, leveraging Metal GPU acceleration.

Best for: Apple Silicon Macs, best Mac performance

ExLlamaV2

High-performance GPTQ/EXL2 inference. Excellent quantization quality with fast inference on NVIDIA GPUs.

Best for: NVIDIA GPUs with EXL2/GPTQ quants

Section 6

Grades

Will It Run AI assigns letter grades based on a 0-100 score that combines fit status (does it fit in memory?) with a speed bonus (how fast will it generate tokens?).

S≥ 85

Excellent

Runs fast with comfortable headroom. No compromises.

A≥ 70

Great

Runs well with minor tradeoffs. Slightly less headroom or speed.

B≥ 55

Good

Usable but with some limitations. May be tight on memory or slower.

C≥ 40

Usable

Significant tradeoffs. Works but expect slower speeds or limited context.

D≥ 20

Poor

Barely functional. Heavy offloading or very slow generation.

F< 20

Won't run

Does not fit. The model is too large for this hardware.

Speed bonus formula

speedBonus = min(25, sqrt(decodeTps) × 3.5)

The speed bonus rewards faster hardware and smaller models, adding up to 25 points on top of the base fit score. A model decoding at 50 tok/s gets ~24.7 bonus points; one at 10 tok/s gets ~11.1 points.

Section 7

Methodology

Fit engine

Estimates whether a model fits on hardware by computing total memory needed (weights + KV cache + runtime overhead + headroom) against available VRAM or unified memory. Classifies into native_fit, tight_fit, hybrid_fit, unsafe_fit, or no_fit based on memory utilization ratio.

Recommendation engine

Ranks models for a given hardware and workload combination using multi-factor scoring: quality tier, workload specialty match, model freshness, fit status, parameter fit for workload, memory utilization efficiency, decode speed, latency, and context alignment.

Artifact awareness

Recommendations are resolved against known artifacts (GGUF, SafeTensors, EXL2, etc.) and runtime capabilities. The system knows which runtimes support which formats and quantization schemes for each hardware backend.

Evidence confidence

The Rust API includes an evidence confidence endpoint that scores claims based on source reliability, recency, and corroboration. This powers future features like community-reported benchmarks and verified compatibility reports.

Parameters

A model's parameters are the learned weights that encode its knowledge. More parameters generally means better quality, but also more memory to load and run the model.

Dense models activate every parameter on every token. A 70B dense model uses all 70 billion parameters for each prediction.

Why it matters for VRAM: you need enough memory to hold all parameters (total, not active), plus KV cache and runtime overhead. The active parameter count mostly affects speed, not memory.

Quantization

Scheme	Bits	GB per Billion Params	Quality
F16 / BF16	16	2.00	Reference (lossless)
Q8_0	8	~1.02	Near-lossless
Q6_K	6	~0.78	Excellent
Q5_K_M	5	~0.68	Very good
Q4_K_M	4	~0.58	Good (sweet spot)
Q3_K_M	3	~0.46	Acceptable
Q2_K	2	~0.34	Noticeable degradation

VRAM & Memory

The core formula for memory needed to run a model:

VRAM needed ≈ params × gbPerBillion + KV cache + runtime overhead + headroom

GPU VRAM is dedicated video memory. An RTX 4090 has 24 GB, an RTX 5090 has 32 GB, and an A100 has 80 GB. What you see is what you get.

VRAM Calculator

Select a model size, quantization level, and GPU to see if it fits — and which hardware options work.

Model size

billion params

Quantization

Available VRAM

7B @ Q4_K_M → 6.1 GB needed

4.3 GB weights + 0.8 GB KV cache + 1 GB runtime

SExcellent

VRAM usage25% of 24 GB

GPUs that can run this:

Bandwidth

Memory bandwidth determines how fast you can generate tokens, more than raw compute power. LLM inference is memory-bound: each token requires reading the entire model weights from memory.

decode tok/s ≈ (bandwidth / weightsGB) × efficiency

GPU bandwidth examples: RTX 4090 has 1,008 GB/s, RTX 5090 has 1,792 GB/s, A100 80GB has 2,039 GB/s. Higher bandwidth means faster token generation for the same model size.

Runtimes

Ollama

Easiest way to get started. One-command install, automatic model management, built-in API server. Uses llama.cpp under the hood.

Best for: Getting started, simple local use

llama.cpp

The underlying inference engine for most local AI tools. Most flexible, supports advanced features like grammar-constrained generation and speculative decoding.

Best for: Power users, custom setups, maximum control

LM Studio

GUI-based model manager and chat interface. Good for exploring models, comparing outputs, and non-technical users.

Best for: Exploration, visual interface, quick testing

vLLM

Production-grade serving engine with batched inference, continuous batching, and PagedAttention. Optimized for throughput over latency.

Best for: Production serving, multi-user, high throughput

MLX

Apple's native ML framework for Apple Silicon. Best performance on Macs with unified memory, leveraging Metal GPU acceleration.

Best for: Apple Silicon Macs, best Mac performance

ExLlamaV2

High-performance GPTQ/EXL2 inference. Excellent quantization quality with fast inference on NVIDIA GPUs.

Best for: NVIDIA GPUs with EXL2/GPTQ quants

Grades

Will It Run AI assigns letter grades based on a 0-100 score that combines fit status (does it fit in memory?) with a speed bonus (how fast will it generate tokens?).

S≥ 85

Excellent

Runs fast with comfortable headroom. No compromises.

A≥ 70

Great

Runs well with minor tradeoffs. Slightly less headroom or speed.

B≥ 55

Good

Usable but with some limitations. May be tight on memory or slower.

C≥ 40

Usable

Significant tradeoffs. Works but expect slower speeds or limited context.

D≥ 20

Poor

Barely functional. Heavy offloading or very slow generation.

F< 20

Won't run

Does not fit. The model is too large for this hardware.

Speed bonus formula

speedBonus = min(25, sqrt(decodeTps) × 3.5)

Methodology