Model Lineup and Selection Guide

2 min read

Approximate cost per 1M output tokens (USD)

Specialized models beyond text generation

🖼️
Image Generation

Flash Image and 3 Pro Image models generate and edit images natively within the conversation.

🗣️
Text to Speech

Flash TTS and Pro TTS models generate controllable, natural speech from text.

🎙️
Live Audio/Video

Native Audio models power real time voice agents with natural pacing via the Live API.

🖥️
Computer Use

Browser control agents that automate tasks by seeing screens and interacting with UI elements.

🤖
Robotics ER

Embodied reasoning model for physical world interaction and robot control.

🔍
Embeddings

High quality vector embeddings for semantic search, clustering, and RAG pipelines.

Model selection rule of thumb

Start with Gemini 2.5 Flash for most tasks. It provides thinking capabilities, 1M token context, and a strong balance of cost and quality. Only move to Pro when you genuinely need stronger reasoning, and move to Flash Lite when cost is the primary constraint.