Google launches Gemma 4 under Apache 2.0: open, multimodal AI from phones to workstations

Google’s Gemma 4 lands under Apache 2.0 with edge-ready E2B/E4B and 26B/31B models, agentic features, and top Arena ranks—built to run on your hardware.

ASOasis
5 min read
Google launches Gemma 4 under Apache 2.0: open, multimodal AI from phones to workstations

Image used for representation purposes only.

Google releases Gemma 4 under Apache 2.0

Google DeepMind has launched Gemma 4, a new open-weight model family designed for advanced reasoning and agentic workflows, and licensed under Apache 2.0. Announced on April 2, 2026, the lineup spans phones to workstations and is built on the same research foundations as Gemini 3. Google says the 31B dense model currently ranks third among open models on Arena AI’s text leaderboard, with the 26B Mixture-of-Experts (MoE) model in sixth. (blog.google )

Four models, one playbook: run anywhere

Gemma 4 arrives in four sizes tailored to different deployment targets:

  • E2B (Effective 2B) and E4B (Effective 4B) for phones, IoT, and other edge devices
  • 26B MoE prioritizing speed by activating ~4B parameters at inference
  • 31B dense maximizing raw quality on consumer GPUs and workstations

Google frames the family as “byte for byte” its most capable open models to date, adding native support for function calling, structured JSON output, and system prompts to power agentic pipelines beyond basic chat. All models are multimodal for text and images, with audio supported on the small edge variants. (blog.google )

Model-card details confirm long-context windows—128K on E2B/E4B and up to 256K on 26B/31B—plus multilingual coverage trained across 140+ languages. The 26B MoE activates 3.8B parameters during inference, while E-series variants use per‑layer embeddings to maximize parameter efficiency in tight memory budgets. (huggingface.co )

Why the Apache 2.0 pivot matters

Previous Gemma generations shipped under more restrictive terms; Gemma 4 moves to Apache 2.0, removing ambiguity around commercial use and derivative deployment. That licensing shift could be as consequential as the raw benchmark gains, opening the door for enterprises to standardize on locally deployed, fine‑tuned variants without bespoke legal review. (blog.google )

Google is pairing the license with broad, day‑one ecosystem support: Hugging Face Transformers (and Transformers.js), vLLM, llama.cpp, MLX, Ollama, NVIDIA NIM, ROCm for AMD GPUs, and more. Weights are available via Hugging Face, Kaggle, and Ollama, with simple paths to customize on Colab or Vertex AI. A Kaggle “Gemma 4 Good Challenge” accompanies the release. (blog.google )

Built for the edge—and Android

A companion rollout focuses squarely on on‑device AI. Developers can access Android’s built‑in Gemma 4 through the new AICore Developer Preview, or use Google AI Edge and the AI Edge Gallery to prototype fully on‑device agent skills. Google’s LiteRT‑LM library adds constrained decoding, dynamic context lengths, and aggressive quantization options; Google cites a Raspberry Pi 5 prefill throughput of ~133 tokens/sec on E2B. A new litert-lm CLI enables no‑code trials and tool‑calling demos. (developers.googleblog.com )

On Android, Gemma 4 underpins the next Gemini Nano generation: code built against Gemma 4 today is intended to run on Gemini Nano 4 devices later this year. Google highlights improved reasoning, math, time understanding, OCR, and energy use, noting the E2B variant runs roughly 3× faster than E4B for latency‑sensitive tasks. (android-developers.googleblog.com )

Benchmarks and capabilities at a glance

Early model‑card results show strong step‑ups versus prior open models of similar size. Selected instruction‑tuned scores for Gemma 4 include: MMLU Pro 85.2 (31B) and 82.6 (26B A4B); AIME 2026 (no tools) 89.2 (31B); LiveCodeBench v6 80.0 (31B). Long‑context retrieval (MRCR v2, 128K) reaches 66.4 (31B). These are reported across reasoning, coding, long‑context, and vision tasks, with audio metrics provided for the E‑series. (huggingface.co )

On the community‑driven Arena leaderboard (snapshot dated March 31, 2026), gemma‑4‑31b appears at #3 among open models, and gemma‑4‑26b‑a4b at #6, reflecting rapid community validation alongside Google’s internal evaluations. (arena.ai )

Video and visual token budgeting

In addition to images, the 31B instruction‑tuned model can process video as sequences of frames and supports configurable visual token budgets to balance speed versus fidelity—useful for toggling between quick classification and detailed document OCR. NVIDIA’s model card lists support for video inputs up to about 60 seconds at one frame per second and predefined visual token budgets (e.g., 70–1120). (build.nvidia.com )

Safety and the road back from 2025

Gemma’s renewed push into open models follows a bruising episode in November 2025, when Google pulled Gemma access from AI Studio after a high‑profile hallucination involving a U.S. senator prompted backlash and renewed scrutiny of model governance. The company repositioned Gemma as strictly developer‑facing at the time. (techradar.com )

With Gemma 4, Google emphasizes “trust and safety” process rigor within an open, Apache‑licensed release, pitching the models as transparent, sovereign‑friendly foundations that still meet enterprise security expectations. As always with open weights, the burden shifts to implementers to add domain‑specific guardrails and monitoring in production. (blog.google )

Getting started

  • Try Gemma 4 (31B and 26B) via Google AI Studio; run E2B/E4B locally through Google AI Edge Gallery, AICore, and LiteRT‑LM; or pull weights from major hosting hubs. (blog.google )
  • On Android, enroll in the AICore Developer Preview to begin prototyping agentic, on‑device flows that will map to Gemini Nano 4. (android-developers.googleblog.com )

Quick start (desktop, text‑only example):

from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "google/gemma-4-31B-it"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto")
inputs = tok("Explain how you’d schedule three dependent tasks with retries.", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=300)
print(tok.decode(outputs[0], skip_special_tokens=True))

The bottom line

Gemma 4 is less a single model than a local‑first agent stack: Apache‑licensed open weights, phone‑class variants with native audio and long context, workstation‑class models with competitive reasoning, and an ecosystem built to run them anywhere. For teams balancing latency, privacy, and cost, this is a notable upgrade—especially if you can keep inference on your hardware and reserve cloud capacity for scale‑out bursts. (blog.google )

Related Posts