Model Versions | DeepSeek

Total parameters (V3.2)

Active per token (MoE)

Context window (tokens)

	Parameters	Best For
R1-Distill-Qwen-1.5B	1.5B	Mobile devices, edge computing, quick prototypes
R1-Distill-Qwen-7B	7B	Consumer GPUs, local development, cost sensitive deployments
R1-Distill-Llama-8B	8B	Consumer GPUs, Llama ecosystem compatibility
R1-Distill-Qwen-14B	14B	Mid range GPUs, balanced quality and speed
R1-Distill-Qwen-32B	32B	High quality local inference, professional workstations
R1-Distill-Llama-70B	70B	Server deployment, near frontier quality at lower cost

DeepSeek model history

DeepSeek LLM (V1)

Nov 2023

67B parameters, initial competitive Chinese LLM

DeepSeek V2

May 2024

Introduced MoE + MLA; 236B total / 21B active

DeepSeek V2.5

Sep 2024

Merged chat and coding into unified model

DeepSeek V3

Jan 2025

671B/37B MoE; frontier performance at fraction of cost

DeepSeek R1

Jan 2025

Reasoning model; the 'DeepSeek moment'

R1-0528

May 2025

Updated reasoning with improved accuracy

DeepSeek V3.2

Dec 2025

Current flagship; refined V3 across all tasks

DeepSeek LLM (V1)

Nov 2023

67B parameters, initial competitive Chinese LLM

DeepSeek V2

May 2024

Introduced MoE + MLA; 236B total / 21B active

DeepSeek V2.5

Sep 2024

Merged chat and coding into unified model

DeepSeek V3

Jan 2025

671B/37B MoE; frontier performance at fraction of cost

DeepSeek R1

Jan 2025

Reasoning model; the 'DeepSeek moment'

R1-0528

May 2025

Updated reasoning with improved accuracy

DeepSeek V3.2

Dec 2025

Current flagship; refined V3 across all tasks

The Mixture of Experts advantage

The MoE architecture is why DeepSeek can deliver frontier quality at dramatically lower cost. The model has 671B parameters organized into expert subnetworks, but only ~37B activate per token. This means the knowledge capacity of a massive model at the computational cost of a much smaller one.