How much VRAM does Llama 3.1 70B need?

Llama 3.1 70B (70B parameters) requires approximately 62.8 GB of memory with Q4_K_M quantization.

What is the best quantization for Llama 3.1 70B?

The recommended quantization for Llama 3.1 70B is Q4_K_M, which balances quality and memory efficiency.

Can it run?

Can NVIDIA A100 80GB run Llama 3.1 70B?

Q: Can NVIDIA A100 80GB run Llama 3.1 70B?

Yes, NVIDIA A100 80GB can run Llama 3.1 70B with a C grade (Runs well). Expected decode speed: 40.1 tok/s.

CUsable

Runs well

Using Q4_K_M in Ollama

Capabilities:

Fit status

Runs well

Decode

40.1 tok/s

TTFT

4827 ms

Safe context

20K

Memory

62.8 GB / 80.0 GB

Memory breakdown

Weights42.7 GB

KV Cache10.9 GB

Runtime1.2 GB

Headroom8.0 GB

Performance by workload

Workload	Grade	Fit	Decode	TTFT	Context
Agentic Coding	C	Tight fit	40.1 tok/s	7020 ms	35K
Chat	C	Runs well	40.1 tok/s	2633 ms	11K
Coding	C	Runs well	40.1 tok/s	4827 ms	20K
RAG	C	Tight fit	40.1 tok/s	8776 ms	35K
Reasoning	C	Runs well	40.1 tok/s	5704 ms	20K

Quantization options

How Llama 3.1 70B (70B params) fits at each quantization level on NVIDIA A100 80GB (80.0 GB usable).

Quant	Bits	VRAM	Quality	Fit
Q2_K	2	27.3 GB	Low	D37
Q3_K_S	3	34.3 GB	Low	D39
NVFP4	4	39.2 GB	Medium	D40
Q4_K_M	4	42.7 GB	Medium	C41
Q5_K_M	5	50.4 GB	High	C43
Q6_KBest for your GPU	6	57.4 GB	High	C44
Q8_0	8	74.9 GB	Very High	C44
F16	16	143.5 GB	Maximum	F0

Get started

Ollama

ollama run llama-3.1-70b

HuggingFace

huggingface-cli download llama-3.1-70b

Upgrade options

Hardware that runs Llama 3.1 70B well

NVIDIA GH200 96GBBudget pick

B75.9 tok/s decode

NVIDIA H20 96GBBiggest leap

B75.9 tok/s decode

See all results for NVIDIA A100 80GB See all hardware for Llama 3.1 70B

Can it run?

Can NVIDIA A100 80GB run Llama 3.1 70B?

CUsable

Runs well

Using Q4_K_M in Ollama

Capabilities:

Fit status

Runs well

Decode

40.1 tok/s

TTFT

4827 ms

Safe context

20K

Memory

62.8 GB / 80.0 GB

Memory breakdown

Weights42.7 GB

KV Cache10.9 GB

Runtime1.2 GB

Headroom8.0 GB

Performance by workload

Workload	Grade	Fit	Decode	TTFT	Context
Agentic Coding	C	Tight fit	40.1 tok/s	7020 ms	35K
Chat	C	Runs well	40.1 tok/s	2633 ms	11K
Coding	C	Runs well	40.1 tok/s	4827 ms	20K
RAG	C	Tight fit	40.1 tok/s	8776 ms	35K
Reasoning	C	Runs well	40.1 tok/s	5704 ms	20K

Quantization options

How Llama 3.1 70B (70B params) fits at each quantization level on NVIDIA A100 80GB (80.0 GB usable).

Quant	Bits	VRAM	Quality	Fit
Q2_K	2	27.3 GB	Low	D37
Q3_K_S	3	34.3 GB	Low	D39
NVFP4	4	39.2 GB	Medium	D40
Q4_K_M	4	42.7 GB	Medium	C41
Q5_K_M	5	50.4 GB	High	C43
Q6_KBest for your GPU	6	57.4 GB	High	C44
Q8_0	8	74.9 GB	Very High	C44
F16	16	143.5 GB	Maximum	F0

Get started

Ollama

ollama run llama-3.1-70b

HuggingFace

huggingface-cli download llama-3.1-70b

Upgrade options

Hardware that runs Llama 3.1 70B well

NVIDIA GH200 96GBBudget pick

B75.9 tok/s decode

NVIDIA H20 96GBBiggest leap

B75.9 tok/s decode

See all results for NVIDIA A100 80GB See all hardware for Llama 3.1 70B