Model Versions | DeepSeek

Total parameters (V4 Pro)

Active per token (V4 Pro MoE)

Context window (tokens)

	Parameters	Best For
R1-Distill-Qwen-1.5B	1.5B	Mobile devices, edge computing, quick prototypes
R1-Distill-Qwen-7B	7B	Consumer GPUs, local development, cost sensitive deployments
R1-Distill-Llama-8B	8B	Consumer GPUs, Llama ecosystem compatibility
R1-Distill-Qwen-14B	14B	Mid range GPUs, balanced quality and speed
R1-Distill-Qwen-32B	32B	High quality local inference, professional workstations
R1-Distill-Llama-70B	70B	Server deployment, near frontier quality at lower cost

DeepSeek model history

DeepSeek LLM (V1)

Nov 2023

67B parameters, initial competitive Chinese LLM

DeepSeek V2

May 2024

Introduced MoE + MLA; 236B total / 21B active

DeepSeek V2.5

Sep 2024

Merged chat and coding into unified model

DeepSeek V3

Jan 2025

671B/37B MoE; frontier performance at fraction of cost

DeepSeek R1

Jan 2025

Reasoning model; the 'DeepSeek moment'

R1-0528

May 2025

Updated reasoning with improved accuracy

DeepSeek V3.1

Aug 2025

Hybrid reasoning architecture; single model supports thinking and non thinking modes

DeepSeek V3.2

Dec 2025

Refined V3 across all tasks

DeepSeek V4

Apr 2026

V4 Pro (1.6T/49B) and V4 Flash (284B/13B); 1M context; open sourced

DeepSeek LLM (V1)

Nov 2023

67B parameters, initial competitive Chinese LLM

DeepSeek V2

May 2024

Introduced MoE + MLA; 236B total / 21B active

DeepSeek V2.5

Sep 2024

Merged chat and coding into unified model

DeepSeek V3

Jan 2025

671B/37B MoE; frontier performance at fraction of cost

DeepSeek R1

Jan 2025

Reasoning model; the 'DeepSeek moment'

R1-0528

May 2025

Updated reasoning with improved accuracy

DeepSeek V3.1

Aug 2025

Hybrid reasoning architecture; single model supports thinking and non thinking modes

DeepSeek V3.2

Dec 2025

Refined V3 across all tasks

DeepSeek V4

Apr 2026

V4 Pro (1.6T/49B) and V4 Flash (284B/13B); 1M context; open sourced

The Mixture of Experts advantage

The MoE architecture is why DeepSeek can deliver frontier quality at dramatically lower cost. V4 Pro has 1.6T parameters organized into expert subnetworks, but only ~49B activate per token. V4 Flash uses 284B total with just ~13B active. This means the knowledge capacity of a massive model at the computational cost of a much smaller one.