NVIDIA

NVIDIA A100 40GB

Name: NVIDIA A100 40GB
Brand: NVIDIA
Price: 10000 USD

Ampere DatacenterDatacenterAmperePCIe 4CUDA

40GB

VRAM

1.6kGB/s

Bandwidth

312TFLOPS

FP16 Compute

624TOPS

INT8 Inference

$10,000 MSRP

NVIDIA A100 40GBCategory AvgMacBook Pro M1 Max 64GB

Specifications

Compute

FP16312 TFLOPS

INT8624 TOPS

ArchitectureAmpere

Memory

VRAM40 GB

Bandwidth1555 GB/s

General

FamilyAmpere Datacenter

SegmentDatacenter

InterconnectPCIe 4

Compute PlatformCUDA

MSRP$10,000

Architecture

Ampere

Ampere is NVIDIA's second-generation RTX architecture, built on Samsung's 8nm process. It introduced 3rd-generation Tensor Cores with support for sparsity-accelerated INT8 operations and improved FP16 throughput over Turing.

AI Relevance

Sparsity-aware Tensor Cores can effectively double throughput for structured sparse workloads. However, the lack of FP8 support means quantized inference is less efficient than Ada Lovelace or Blackwell.

Process: Samsung 8nmPlatform: CUDATensor Cores: Gen 3Precisions: FP32, FP16, BF16, INT8, INT4

Recommendations by Workload

Agentic Coding

Devstral Small 2 24B Instruct

This model is still usable for agentic-coding, but it is not the most specialized pick. It belongs to a current frontier family for local AI. It fits natively with comfortable headroom. Known channels: huggingface, ollama, lm-studio.

Decode 89.2 tok/s · 47K ctx · llama.cpp

27.0 GB / 40.0 GB VRAM

Chat

Qwen 3 30B A3B

This model is a direct match for chat. It belongs to a current frontier family for local AI. It fits natively with comfortable headroom. Known channels: huggingface, ollama, lm-studio.

Decode 181.6 tok/s · 13K ctx · llama.cpp

24.3 GB / 40.0 GB VRAM

Coding

Devstral Small 2 24B Instruct

This model is a direct match for coding. It belongs to a current frontier family for local AI. It fits natively with comfortable headroom. Known channels: huggingface, ollama, lm-studio.

Decode 89.2 tok/s · 27K ctx · llama.cpp

23.3 GB / 40.0 GB VRAM

RAG

Codestral 21B Pruned i1

This model is a direct match for rag. It sits in the middle of the current model mix. It fits natively with comfortable headroom.

Decode 102.0 tok/s · 53K ctx · llama.cpp

24.3 GB / 40.0 GB VRAM

Reasoning

Devstral Small 2 24B Instruct

This model is a direct match for reasoning. It belongs to a current frontier family for local AI. It fits natively with comfortable headroom. Known channels: huggingface, ollama, lm-studio.

Decode 89.2 tok/s · 27K ctx · llama.cpp

23.3 GB / 40.0 GB VRAM

Full Model Compatibility

NVIDIA A100 40GB

Specifications

Ampere

Recommendations by Workload

Full Model Compatibility

Models you could run with an upgrade

Upgrade from NVIDIA A100 40GB

Upgrade options

NVIDIA A100 40GB

Specifications

Ampere

Recommendations by Workload

Full Model Compatibility

Models you could run with an upgrade

Upgrade from NVIDIA A100 40GB

Upgrade options