Model Versions

1 min read

0B

Total parameters (V3.2)

0B

Active per token (MoE)

0K

Context window (tokens)

ParametersBest For
R1-Distill-Qwen-1.5B1.5BMobile devices, edge computing, quick prototypes
R1-Distill-Qwen-7B7BConsumer GPUs, local development, cost sensitive deployments
R1-Distill-Llama-8B8BConsumer GPUs, Llama ecosystem compatibility
R1-Distill-Qwen-14B14BMid range GPUs, balanced quality and speed
R1-Distill-Qwen-32B32BHigh quality local inference, professional workstations
R1-Distill-Llama-70B70BServer deployment, near frontier quality at lower cost

DeepSeek model history

DeepSeek LLM (V1)

Nov 2023

67B parameters, initial competitive Chinese LLM

DeepSeek V2

May 2024

Introduced MoE + MLA; 236B total / 21B active

DeepSeek V2.5

Sep 2024

Merged chat and coding into unified model

DeepSeek V3

Jan 2025

671B/37B MoE; frontier performance at fraction of cost

DeepSeek R1

Jan 2025

Reasoning model; the 'DeepSeek moment'

R1-0528

May 2025

Updated reasoning with improved accuracy

DeepSeek V3.2

Dec 2025

Current flagship; refined V3 across all tasks

The Mixture of Experts advantage

The MoE architecture is why DeepSeek can deliver frontier quality at dramatically lower cost. The model has 671B parameters organized into expert subnetworks, but only ~37B activate per token. This means the knowledge capacity of a massive model at the computational cost of a much smaller one.