NVIDIA

RTX 3060 12GB

Name: RTX 3060 12GB
Brand: NVIDIA
Price: 329 USD

RTX 30ConsumerAmperePCIe 4CUDA

12GB

VRAM

360GB/s

Bandwidth

25TFLOPS

FP16 Compute

200TOPS

INT8 Inference

170W TDP$329 MSRPReleased Feb 2021

RTX 3060 12GBCategory AvgMacBook Pro M3 Pro 18GB

About this GPU for AI

The RTX 3060 12GB is one of the most popular entry points for local AI inference. Its generous 12 GB of GDDR6 VRAM — more than the RTX 3060 Ti — allows it to run 7B parameter models at full precision and 13B models with Q4 quantization. While its compute throughput is modest, the VRAM capacity makes it a budget-friendly option for getting started with local LLMs.

Official product page ↗

budget-friendlygood-vram-per-dollarlow-tdpwidely-available

Specifications

Compute

FP1625 TFLOPS

INT8200 TOPS

ArchitectureAmpere

CUDA Cores3,584

Tensor Cores112

Memory

VRAM12 GB

Bandwidth360 GB/s

TypeGDDR6

General

FamilyRTX 30

SegmentConsumer

InterconnectPCIe 4

Compute PlatformCUDA

MSRP$329

TDP170W

ReleasedFeb 2021

Key Features

2nd Gen RT Cores3rd Gen Tensor CoresDLSS 2.0PCIe Gen 4 x16CUDA Compute 8.612 GB GDDR6 (more than 3060 Ti)

For AI Workloads

Strengths

12 GB VRAM is generous for its price — fits 7B models at FP16 and 13B at Q4
Low TDP (170W) means it works in almost any desktop build
Very affordable — often the best VRAM-per-dollar in the used market
Widely available with mature driver support

Considerations

Limited compute (25 TFLOPS FP16) means slower token generation than higher-end cards
360 GB/s bandwidth is a bottleneck for decode speed on larger models
Ampere Tensor Cores lack FP8 support available in Ada Lovelace and newer
Cannot run 30B+ models even with aggressive quantization

Architecture

Ampere

Ampere is NVIDIA's second-generation RTX architecture, built on Samsung's 8nm process. It introduced 3rd-generation Tensor Cores with support for sparsity-accelerated INT8 operations and improved FP16 throughput over Turing.

AI Relevance

Sparsity-aware Tensor Cores can effectively double throughput for structured sparse workloads. However, the lack of FP8 support means quantized inference is less efficient than Ada Lovelace or Blackwell.

Process: Samsung 8nmPlatform: CUDATensor Cores: Gen 3Precisions: FP32, FP16, BF16, INT8, INT4

The RTX 3060 uses the GA106 GPU die with 28 Streaming Multiprocessors, each containing 128 CUDA cores. While its 112 Tensor Cores are modest, they provide meaningful acceleration for quantized inference workloads.

Notably, the RTX 3060 uses a wider 192-bit memory bus than the RTX 3060 Ti (256-bit), but compensates with more VRAM chips to reach 12 GB total. For AI workloads, this VRAM advantage is significant — it determines which models can run entirely on-GPU versus requiring slower CPU offloading.

Recommendations by Workload

Agentic Coding

Granite 3.1 8B

This model is still usable for agentic-coding, but it is not the most specialized pick. It sits in the middle of the current model mix. It fits natively with comfortable headroom. Known channels: huggingface, ollama.

Decode 48.7 tok/s · 41K ctx · llama.cpp

9.5 GB / 12.0 GB VRAM

Chat

Qwen 3 8B

This model is a direct match for chat. It belongs to a current frontier family for local AI. It fits natively with comfortable headroom. Known channels: huggingface, ollama, lm-studio.

Decode 48.7 tok/s · 12K ctx · llama.cpp

7.8 GB / 12.0 GB VRAM

Coding

Codestral Mamba 7B

This model is still usable for coding, but it is not the most specialized pick. It sits in the middle of the current model mix. It fits natively with comfortable headroom. Known channels: huggingface, ollama.

Decode 55.6 tok/s · 26K ctx · llama.cpp

7.5 GB / 12.0 GB VRAM

RAG

granite 8b code instruct 4k

This model is a direct match for rag. It sits in the middle of the current model mix. It fits natively with comfortable headroom.

Decode 48.7 tok/s · 41K ctx · llama.cpp

9.5 GB / 12.0 GB VRAM

Reasoning

Qwen 3 8B

This model is a direct match for reasoning. It belongs to a current frontier family for local AI. It fits natively with comfortable headroom. Known channels: huggingface, ollama, lm-studio.

Decode 48.7 tok/s · 23K ctx · llama.cpp

8.2 GB / 12.0 GB VRAM

Full Model Compatibility

RTX 3060 12GB

About this GPU for AI

Specifications

Key Features

For AI Workloads

Ampere

Recommendations by Workload

Full Model Compatibility

Models you could run with an upgrade

Upgrade from RTX 3060 12GB

Upgrade options

RTX 3060 12GB

About this GPU for AI

Specifications

Key Features

For AI Workloads

Ampere

Recommendations by Workload

Full Model Compatibility

Models you could run with an upgrade

Upgrade from RTX 3060 12GB

Upgrade options