Multimodal Input: Text, Images, Audio, Video, PDFs | Gemini API

	Modality	OpenAI GPT
Text	Native	Native
Images	Native	Native
Audio	Native	Native (up to 11 hrs)
Video	Frame extraction only	Native (up to 1 hr)
PDFs	Via file upload	Native (up to 1,000 pages)

📄Text: Prompts, code, structured data, instructions

🖼️Images: JPEG, PNG, GIF, WebP for visual QA and OCR

🎵Audio: Up to 11 hours of MP3, WAV, FLAC for transcription and analysis

🎬Video: Up to 1 hour of MP4 for scene analysis and content extraction

📑PDFs: Up to 1,000 pages with full multimodal document understanding

Audio pricing differs from text pricing

On models like Gemini 2.5 Flash, audio input costs $1.00/M tokens compared to $0.30/M for text/image/video. Plan your multimodal usage accordingly. Audio heavy workloads may benefit from transcribing audio to text first if the use case does not require acoustic analysis.