NativeAIHub
Google Gemma logo

Google Gemma

open-source

by google

Google DeepMind's open weight LLM family built from the same research as Gemini, now in its fourth generation with sizes from E2B to 31B parameters.

Key features

Open Weights
Multimodal Input (Text, Image, Video, Audio)
Up to 256K Token Context Window
140+ Language Support
Quantization Aware Training
Function Calling and Agentic Workflows
Pricing

Free tier available

Best For

Developers wanting to run AI models locally on their own hardware without API dependencies or usage fees

Verdict

Completely free to download and use, with no API fees or usage limits when self hosted

What it does

Open Weights

Download the full model weights and run Gemma anywhere: your laptop, a cloud server, a mobile device, or an edge appliance. No API dependency, no usage fees, complete control.

Learn more

Multimodal Input (Text, Image, Video, Audio)

Gemma 4 models natively process text, images, and video with variable resolution support. The E2B and E4B edge models also feature native audio input for speech recognition and understanding.

Up to 256K Token Context Window

Process long documents, codebases, and extended conversations with a 256K token context window on the 26B and 31B models, and 128K on the E2B and E4B edge models.

140+ Language Support

Gemma 4 supports over 140 languages out of the box, making it one of the most multilingual open model families available.

Quantization Aware Training

Gemma checkpoints are trained with quantization awareness, meaning the models maintain high quality even when compressed to lower bit formats (INT4, INT8) for efficient deployment.

Learn more

Function Calling and Agentic Workflows

Gemma 4 features native support for function calling, structured JSON output, and system instructions, enabling you to build autonomous agents that interact with tools and APIs and execute multi step workflows reliably.

Runs on Consumer Hardware

Gemma 4 models are sized to run and fine tune efficiently on hardware ranging from billions of Android devices to laptop GPUs and developer workstations. Quantized versions of the 26B and 31B run natively on consumer GPUs.

Learn more

Gemma 4 E2B and E4B for Mobile and On Device

Edge optimized models that activate an effective 2 billion and 4 billion parameter footprint during inference to preserve RAM and battery life. They support text, image, video, and audio input, run completely offline with near zero latency on phones, Raspberry Pi, and NVIDIA Jetson Orin Nano.

Specialized Variants

A growing ecosystem of purpose built models: MedGemma (medical), CodeGemma (code generation), PaliGemma (vision and language), ShieldGemma (safety classification), FunctionGemma (on device function calling and tool use), and TranslateGemma (translation).

Learn more

Available Everywhere

Download from Hugging Face, run via Ollama, experiment on Kaggle, or use through Google AI Studio and Vertex AI. Gemma integrates into every major ML ecosystem and inference framework.

Fine Tuning Friendly

All Gemma models support LoRA, QLoRA, and full parameter fine tuning. Gemma 4 can be fine tuned on platforms ranging from Google Colab and Vertex AI to consumer gaming GPUs, with day one support from Unsloth, Keras, and Hugging Face TRL.

Learn more

Pricing

Open Weight

Free

Gemma is completely free to download and run. All model weights are available on Hugging Face, Kaggle, and other platforms at no cost. You only pay for the compute hardware you choose to run the models on, whether that is your own laptop, a cloud GPU, or a managed inference service.

  • All model sizes (E2B, E4B, 26B, 31B) free to download
  • Full model weights on Hugging Face, Kaggle, and Ollama
  • Run locally via Ollama, llama.cpp, vLLM, LM Studio, or any compatible framework
  • Apache 2.0 license with full commercial use and open source flexibility
  • Fine tuning and modification allowed
  • No API fees or usage limits when self hosted
  • Community of 100,000+ variants and fine tunes

Pros & Cons

Pros

  • Completely free to download and use, with no API fees or usage limits when self hosted
  • Runs on consumer hardware: quantized Gemma 4 models run natively on consumer GPUs, and E2B/E4B run on smartphones and edge devices
  • Up to 256K token context window on Gemma 4 larger models (128K on edge models), competitive with much larger proprietary models
  • Multimodal input (text, image, video) across all Gemma 4 sizes, plus native audio on E2B/E4B edge models
  • Massive community with 400M+ downloads and 100,000+ fine tuned variants on Hugging Face
  • Specialized variants for medical, safety, code, vision, and translation tasks, ready to use out of the box
  • Quantization aware trained checkpoints maintain high quality even at INT4 precision, maximizing hardware efficiency
  • Built from the same research as Gemini, delivering frontier quality relative to model size

Cons

  • Gemma 4 is Apache 2.0, but older Gemma 3 and earlier models remain under the more restrictive Gemma Terms of Use, so check which generation you are using
  • Smaller than proprietary Gemini models (31B max vs Gemini's much larger architectures), so raw capability has a ceiling
  • As an open weight model you run locally, there is no live data access, so responses are limited to the model's training data cutoff
  • Audio input is only available on the E2B and E4B edge models; the 26B and 31B models do not support native audio
  • Requires technical setup to run locally: downloading models, configuring inference frameworks, and managing hardware resources
  • The E2B edge model is limited in reasoning capability compared to the larger sizes and is best suited for simple on device tasks or fine tuned applications

How to get started

1

Choose your model size

Select the Gemma 4 variant that fits your hardware. The E2B and E4B models run on smartphones, Raspberry Pi, and edge devices. The 26B MoE model (4B active parameters) focuses on inference speed and fits on consumer GPUs. The 31B Dense model maximizes quality and fits on a single 80GB GPU in bfloat16, or on consumer GPUs when quantized.

2

Run with Ollama (easiest path)

Install Ollama, then run 'ollama run gemma4' to start chatting with Gemma 4 locally. Ollama handles downloading, quantization, and serving automatically. Choose the model variant that fits your hardware from the Ollama library.

3

Try in Google AI Studio (no setup)

If you want to experiment without any local setup, use Google AI Studio to try Gemma 4 31B and 26B directly in the browser. This is useful for testing prompts and capabilities before committing to a local deployment.

4

Fine tune for your use case

Once you have selected a base model, fine tune it on your own data using LoRA or QLoRA for maximum customization. Hugging Face TRL, Unsloth, and Keras all support Gemma 4 fine tuning. You can fine tune on platforms from Google Colab and Vertex AI to consumer gaming GPUs.

Deep dive

Detailed guides with comparisons, tips, and visuals for each feature.

Get notified about updates

We'll email you when this tool's pricing or features change.

Last updated: 2026-06-01