Approximate cost per 1M output tokens (USD)
Specialized models beyond text generation
🖼️
Image Generation
Flash Image and 3 Pro Image models generate and edit images natively within the conversation.
🗣️
Text to Speech
Flash TTS and Pro TTS models generate controllable, natural speech from text.
🎙️
Live Audio/Video
Native Audio models power real time voice agents with natural pacing via the Live API.
🖥️
Computer Use
Browser control agents that automate tasks by seeing screens and interacting with UI elements.
🤖
Robotics ER
Embodied reasoning model for physical world interaction and robot control.
🔍
Embeddings
High quality vector embeddings for semantic search, clustering, and RAG pipelines.
Model selection rule of thumb
Start with Gemini 2.5 Flash for most tasks. It provides thinking capabilities, 1M token context, and a strong balance of cost and quality. Only move to Pro when you genuinely need stronger reasoning, and move to Flash Lite when cost is the primary constraint.