How much VRAM does Falcon 40B Instruct need?

Falcon 40B Instruct (40B parameters) requires approximately 44.3 GB of memory with Q5_K_M quantization.

What is the best quantization for Falcon 40B Instruct?

The recommended quantization for Falcon 40B Instruct is Q5_K_M, which balances quality and memory efficiency.

Can it run?

Can NVIDIA H100 80GB run Falcon 40B Instruct?

Q: Can NVIDIA H100 80GB run Falcon 40B Instruct?

Yes, NVIDIA H100 80GB can run Falcon 40B Instruct with a C grade (Runs well). Expected decode speed: 99.7 tok/s.

CUsable

Runs well

Using Q5_K_M in Ollama

Capabilities:

Fit status

Runs well

Decode

99.7 tok/s

TTFT

1943 ms

Safe context

Memory

44.3 GB / 80.0 GB

Memory breakdown

Weights28.8 GB

KV Cache6.3 GB

Runtime1.2 GB

Headroom8.0 GB

Performance by workload

Workload	Grade	Fit	Decode	TTFT	Context
Agentic Coding	C	Runs well	99.7 tok/s	2826 ms	8K
Chat	C	Runs well	99.7 tok/s	1060 ms	8K
Coding	C	Runs well	99.7 tok/s	1943 ms	8K
RAG	C	Runs well	99.7 tok/s	3532 ms	8K
Reasoning	C	Runs well	99.7 tok/s	2296 ms	8K

Quantization options

How Falcon 40B Instruct (40B params) fits at each quantization level on NVIDIA H100 80GB (80.0 GB usable).

Quant	Bits	VRAM	Quality	Fit
Q2_K	2	15.6 GB	Low	D33
Q3_K_S	3	19.6 GB	Low	D34
NVFP4	4	22.4 GB	Medium	D35
Q4_K_M	4	24.4 GB	Medium	D35
Q5_K_M	5	28.8 GB	High	D36
Q6_K	6	32.8 GB	High	D37
Q8_0Best for your GPU	8	42.8 GB	Very High	C40
F16	16	82.0 GB	Maximum	F0

Get started

Ollama

ollama run falcon-40b-instruct

HuggingFace

huggingface-cli download falcon-40b-instruct

See all results for NVIDIA H100 80GB See all hardware for Falcon 40B Instruct

Can it run?

Can NVIDIA H100 80GB run Falcon 40B Instruct?

CUsable

Runs well

Using Q5_K_M in Ollama

Capabilities:

Fit status

Runs well

Decode

99.7 tok/s

TTFT

1943 ms

Safe context

Memory

44.3 GB / 80.0 GB

Memory breakdown

Weights28.8 GB

KV Cache6.3 GB

Runtime1.2 GB

Headroom8.0 GB

Performance by workload

Workload	Grade	Fit	Decode	TTFT	Context
Agentic Coding	C	Runs well	99.7 tok/s	2826 ms	8K
Chat	C	Runs well	99.7 tok/s	1060 ms	8K
Coding	C	Runs well	99.7 tok/s	1943 ms	8K
RAG	C	Runs well	99.7 tok/s	3532 ms	8K
Reasoning	C	Runs well	99.7 tok/s	2296 ms	8K

Quantization options

How Falcon 40B Instruct (40B params) fits at each quantization level on NVIDIA H100 80GB (80.0 GB usable).

Quant	Bits	VRAM	Quality	Fit
Q2_K	2	15.6 GB	Low	D33
Q3_K_S	3	19.6 GB	Low	D34
NVFP4	4	22.4 GB	Medium	D35
Q4_K_M	4	24.4 GB	Medium	D35
Q5_K_M	5	28.8 GB	High	D36
Q6_K	6	32.8 GB	High	D37
Q8_0Best for your GPU	8	42.8 GB	Very High	C40
F16	16	82.0 GB	Maximum	F0

Get started

Ollama

ollama run falcon-40b-instruct

HuggingFace

huggingface-cli download falcon-40b-instruct

See all results for NVIDIA H100 80GB See all hardware for Falcon 40B Instruct