Multimodal Input: Text, Images, Audio, Video, PDFs

1 min read
ModalityOpenAI GPTGemini API
TextNativeNative
ImagesNativeNative
AudioNativeNative (up to 11 hrs)
VideoFrame extraction onlyNative (up to 1 hr)
PDFsVia file uploadNative (up to 1,000 pages)
📄Text: Prompts, code, structured data, instructions
🖼️Images: JPEG, PNG, GIF, WebP for visual QA and OCR
🎵Audio: Up to 11 hours of MP3, WAV, FLAC for transcription and analysis
🎬Video: Up to 1 hour of MP4 for scene analysis and content extraction
📑PDFs: Up to 1,000 pages with full multimodal document understanding

Audio pricing differs from text pricing

On models like Gemini 2.5 Flash, audio input costs $1.00/M tokens compared to $0.30/M for text/image/video. Plan your multimodal usage accordingly. Audio heavy workloads may benefit from transcribing audio to text first if the use case does not require acoustic analysis.