TII

Falcon 40B Instruct

Name: Falcon 40B Instruct
Author: TII

Legacy

HuggingFace

44.3KDownloads1.2KLikesMay 2023Released8K tokensContextApache 2.0License3 EntryQuality

Get started

— copy & paste to run locally

Ollama

ollama run falcon-40b-instruct

HuggingFace

huggingface-cli download falcon-40b-instruct

Quick specs

Parameters40B

Architecturedense

Context8K tokens

Modalitytext

Min RAM15.6 GB

Rec. RAM28.8 GB (Q5_K_M)

LicenseApache 2.0

FamilyFalcon

✓ Chat✓ Reasoning

About this model

Falcon-40B-Instruct is a 40B parameters causal decoder-only model built by TII based on Falcon-40B and finetuned on a mixture of Baize. It is made available under the Apache 2.0 license.

•You are looking for a ready-to-use chat/instruct model based on Falcon-40B
•Falcon-40B is the best open-source model available.: It outperforms LLaMA, StableLM, RedPajama, MPT, etc. See the OpenLLM Leaderboard
•It features an architecture optimized for inference: , with FlashAttention (Dao et al., 2022) and multiquery (Shazeer et al., 2019)

Related models

Quick picks

Best budgetC

Mac mini M4 64GB~$1,099 — 3 tok/s

Best overallC

NVIDIA H100 80GB~$40,000 — 100 tok/s

Best hardware

Top picks for Falcon 40B Instruct

AMD Instinct MI210 64GBC

64 GB

NVIDIA A100 80GBC

80 GB

NVIDIA H100 PCIe 80GBC

80 GB

Quantization options

VRAM estimates by quant level

No hardware detected — fit column shows raw VRAM estimates

Quant	Bits	VRAM	Quality	Fit
Q2_K	2	15.6 GB	Low	—
Q3_K_S	3	19.6 GB	Low	—
NVFP4	4	22.4 GB	Medium	—
Q4_K_M	4	24.4 GB	Medium	—
Q5_K_M	5	28.8 GB	High	—
Q6_K	6	32.8 GB	High	—
Q8_0	8	42.8 GB	Very High	—
F16	16	82.0 GB	Maximum	—

Hardware compatibility

Fit estimates across all hardware

Open calculator

Computing compatibility...

Memory breakdown

Reference: NVIDIA A10 24GB

Weights28.8 GB

KV Cache6.3 GB

Runtime1.2 GB

Headroom2.4 GB

TII

Falcon 40B Instruct

Legacy

HuggingFace

44.3KDownloads1.2KLikesMay 2023Released8K tokensContextApache 2.0License3 EntryQuality

Get started

— copy & paste to run locally

Ollama

ollama run falcon-40b-instruct

HuggingFace

huggingface-cli download falcon-40b-instruct

Quick specs

Parameters40B

Architecturedense

Context8K tokens

Modalitytext

Min RAM15.6 GB

Rec. RAM28.8 GB (Q5_K_M)

LicenseApache 2.0

FamilyFalcon

✓ Chat✓ Reasoning

About this model

Falcon-40B-Instruct is a 40B parameters causal decoder-only model built by TII based on Falcon-40B and finetuned on a mixture of Baize. It is made available under the Apache 2.0 license.

•You are looking for a ready-to-use chat/instruct model based on Falcon-40B
•Falcon-40B is the best open-source model available.: It outperforms LLaMA, StableLM, RedPajama, MPT, etc. See the OpenLLM Leaderboard
•It features an architecture optimized for inference: , with FlashAttention (Dao et al., 2022) and multiquery (Shazeer et al., 2019)

Related models

Quick picks

Best budgetC

Mac mini M4 64GB~$1,099 — 3 tok/s

Best overallC

NVIDIA H100 80GB~$40,000 — 100 tok/s

Best hardware

Top picks for Falcon 40B Instruct

AMD Instinct MI210 64GBC

64 GB

NVIDIA A100 80GBC

80 GB

NVIDIA H100 PCIe 80GBC

80 GB

Quantization options

VRAM estimates by quant level

No hardware detected — fit column shows raw VRAM estimates

Quant	Bits	VRAM	Quality	Fit
Q2_K	2	15.6 GB	Low	—
Q3_K_S	3	19.6 GB	Low	—
NVFP4	4	22.4 GB	Medium	—
Q4_K_M	4	24.4 GB	Medium	—
Q5_K_M	5	28.8 GB	High	—
Q6_K	6	32.8 GB	High	—
Q8_0	8	42.8 GB	Very High	—
F16	16	82.0 GB	Maximum	—

Hardware compatibility

Fit estimates across all hardware

Open calculator

Computing compatibility...

Memory breakdown

Reference: NVIDIA A10 24GB

Weights28.8 GB

KV Cache6.3 GB

Runtime1.2 GB

Headroom2.4 GB