Google Just Released the Most Capable Open Source AI Agent Model. Here Is What It Means for Your Business.
Google's Gemma 4 scored 86.4% on tau2-bench for agentic tasks, a 13x jump over Gemma 3. Here is what the most capable open source AI agent model means for businesses building AI systems in 2026.

On April 2, 2026, Google released Gemma 4, a family of four open source models built on Gemini 3 research. And one number stands out from every other benchmark in the release notes.
On tau2-bench, the leading evaluation for real-world AI agent performance, Gemma 4's flagship 31B model scores 86.4%. Its predecessor, Gemma 3, scored 6.6% on the same test. That is a 13x improvement in a single generation.
I have been building AI systems professionally across 109 production deployments, and I do not throw around numbers like this lightly. A 13x jump in agentic capability from one model generation to the next is not a normal thing. This one is worth paying attention to.
For anyone deciding right now whether to build AI agents on cloud APIs or self-hosted open source models, Gemma 4 just shifted that calculation.
Key Takeaways
- Google released Gemma 4 on April 2, 2026: four models (E2B, E4B, 26B, 31B) under Apache 2.0 with full commercial freedom
- The 31B model scored 86.4% on tau2-bench for agentic tasks, up from 6.6% in Gemma 3. That is a 13x single-generation improvement.
- Edge models (E2B, E4B) run on smartphones and Raspberry Pi and support text, image, audio, and video in under 8GB
- Apache 2.0 license has no monthly active user caps, unlike Meta's Llama 4 which restricts usage beyond 700 million users
- The 31B model ranks #3 among open models globally on LMArena and #27 overall including closed models like GPT-4o
- For businesses weighing self-hosted agents versus cloud APIs, Gemma 4 meaningfully changes the cost and privacy trade-offs
What Google Released: Four Models With One Clear Mission
Gemma 4 ships as four distinct variants, each targeting a different deployment context. Understanding the model family is worth the two minutes it takes, because the right choice depends heavily on which tier you need.
| Model | Active Params | Total Params | Context Window | Modalities |
|---|---|---|---|---|
| E2B | 2.3B | 5.1B | 128K tokens | Text, Image, Audio, Video |
| E4B | 4.5B | 8B | 128K tokens | Text, Image, Audio, Video |
| 26B A4B | 3.8B active | 26B total (MoE) | 256K tokens | Text, Image, Video |
| 31B Dense | 30.7B | 30.7B | 256K tokens | Text, Image, Video |
The naming deserves explanation. E2B and E4B are edge models. The "E" stands for effective parameters, not total. Google uses a technique called Per-Layer Embeddings (PLE) that injects a secondary residual signal into every decoder layer. This gives the E2B the representational depth of something much larger, while keeping the footprint under 1.5GB with quantization. It is not a stripped-down toy model. It is a small model that behaves like a bigger one.
In practical terms: E2B runs at 7.6 tokens per second on a Raspberry Pi 5. On an Android phone with a neural processing unit, it hits 31 tokens per second. On a Qualcomm Dragonwing chip with full NPU acceleration, it reaches 3,700 tokens per second at prefill. That is real inference on consumer hardware, not a marketing benchmark run on a rack of A100s.
The 26B model uses a Mixture of Experts (MoE) architecture with 128 experts, but only 3.8 billion parameters are active per forward pass. Google's own data shows this achieves roughly 97% of the 31B Dense model's performance while running at a fraction of the compute cost. For businesses building agentic systems where inference costs accumulate, that trade-off matters.
All four models support 140 plus languages. All ship under Apache 2.0. And all are available today on Hugging Face, Google AI Studio, Ollama, Kaggle, and LM Studio with no Google account required for most access points.
The Number That Changed My Mind About Open Source Agents
I want to spend real time on tau2-bench because it is the benchmark that actually matters for the businesses I work with, and most coverage of Gemma 4 buries it underneath math scores and coding competitions.
Most AI benchmarks test knowledge (MMLU), math (AIME), or code generation (LiveCodeBench). These are useful proxies for general reasoning, but they do not tell you whether a model can complete a multi-step business task with tools. tau2-bench does. It simulates real tool calling and decision-making scenarios where an AI agent interacts with external systems, handles ambiguous instructions, and plans sequences of actions to reach a goal. This is what matters when you are deploying an agent to process invoices, route customer tickets, or manage an inventory system.
Gemma 3's score was 6.6%. Effectively: can sometimes string together two tool calls if the path is completely obvious. Gemma 4 31B scores 86.4%. Effectively: reliably completes complex, multi-step agentic tasks.
This is not an incremental improvement. It is a qualitative shift in what the model can do inside an agent system. Here is the full benchmark picture:
| Benchmark | Gemma 4 31B | Gemma 4 26B | Gemma 4 E4B | Gemma 3 27B | What It Measures |
|---|---|---|---|---|---|
| tau2-bench (agentic) | 86.4% | n/a | 57.5% | 6.6% | Real-world agent task performance |
| AIME 2026 (math) | 89.2% | 88.3% | 42.5% | 20.8% | Advanced mathematical reasoning |
| LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | n/a | Code generation quality |
| GPQA Diamond | 84.3% | 82.3% | 58.6% | 42.4% | Expert-level knowledge |
| MMLU Pro | 85.2% | 82.6% | 69.4% | 67.6% | General knowledge breadth |
| Codeforces ELO | 2150 | 1718 | 940 | 110 | Competitive programming |
The Codeforces ELO jump from 110 to 2150 in a single generation is almost hard to believe. For context, 2150 is roughly International Master level in competitive programming. For anyone building an AI agent that needs to write, review, or debug code as part of its workflow, this is a meaningful capability unlock.
What matters for production agent systems, though, is the combination of strong reasoning (AIME), reliable tool use (tau2-bench), and coding ability (LiveCodeBench) simultaneously. Most models have a dominant strength and clear weaknesses. Gemma 4 does not show that asymmetry. All three are strong at once, which is unusual and exactly what you need when building a general-purpose agent.
The E4B scoring 57.5% on tau2-bench also deserves attention. That is a model that fits on your laptop, costs nothing per token, and can handle more than half of typical agentic task scenarios. That was not true of any edge-class model before this release.
Under the Hood: Why This Generation Is Different
The benchmark improvements do not come from simply scaling parameters. Google made specific architectural choices that explain the behavior change.
Alternating Attention Layers: Standard transformers apply full global attention across the entire context at every layer. Gemma 4 alternates: most layers use local sliding window attention (512 token windows on edge models, 1024 on larger ones), while selected layers apply full global attention. This keeps the model computationally efficient on long inputs while still building cross-document understanding where it matters. It is also why the 256K context window on the 26B and 31B models actually performs well at long ranges rather than degrading as documents approach the limit.
Dual RoPE Positioning: The 26B and 31B models use two rotary position embedding strategies simultaneously. Standard RoPE handles the local attention layers. Proportional RoPE handles the global layers. This prevents the quality degradation that typically hits models near their context limits, a common failure mode in long-document agent tasks like contract review or financial analysis.
Configurable Thinking Mode: All four models include thinking mode, Google's implementation of chain of thought reasoning at inference time. You activate it by setting enable_thinking=True in the HuggingFace processor. The model generates internal reasoning tokens before producing its final response. You can expose these reasoning steps or suppress them in the output. For agent systems handling ambiguous or multi-part tasks, thinking mode materially improves planning quality.
Variable Density Vision Tokens: The vision encoder accepts configurable token budgets per image: 70, 140, 280, 560, or 1,120 tokens. Lower settings are fast and sufficient for captioning or classification. Higher settings enable OCR-quality document parsing. For agent systems processing invoices, product images, or screenshots as part of their workflow, this flexibility is genuinely useful and rare at the open source level.
The License Story Is More Important Than the Benchmarks
Before I get into when Gemma 4 makes sense for your business, the licensing situation needs more attention than it is getting in most coverage.
Gemma 4 ships under Apache 2.0. This is not a restricted license with a friendly-sounding name. Apache 2.0 means: use it for anything, modify it, build products on it, charge money for those products, and never pay Google anything. No usage caps. No field-of-use restrictions. No requirement to make your modifications public.
Now compare this to Llama 4, Meta's latest open model family released around the same time. Llama 4 ships under a community license that requires explicit written permission from Meta if your product reaches 700 million monthly active users. For most small and medium businesses, that threshold feels distant. But for anyone building AI infrastructure, a developer platform, or an agent-as-a-service product that could scale, it is a real commercial risk to bake into your foundation.
Gemma 4 has no such restriction. The business case for building on it is legally clean. You own your deployment.
The HuggingFace team called this out explicitly in their launch post, writing that Gemma 4 is "truly open with Apache 2 licenses" and noting that their pre-release testing left them "struggling to find good fine-tuning examples because they are so good out of the box." When the infrastructure team running the world's largest model hub leads with the license in their announcement, that tells you something about how much it matters to serious builders.
For the businesses I work with through AgenticMode AI, the license is often one of the first filter criteria after performance when selecting a model layer for a production agent system. Gemma 4 passes both filters cleanly.
What This Means If You Are Building AI Agents Right Now
Let me be direct about who this release actually changes things for.
Businesses running AI agents on cloud APIs today: Gemma 4 gives you a credible self-hosted option for the first time. The 31B model requires a single NVIDIA H100 80GB at full precision, or fits on a 24GB GPU with Q4 quantization. If you are spending thousands per month on OpenAI or Anthropic API costs, a one-time hardware purchase or dedicated cloud instance starts to look different.
Businesses planning their first agent deployment: Gemma 4 is now the default open source recommendation for anything requiring serious agentic capability. Before this release, I would typically steer most clients toward cloud APIs unless they had specific data privacy requirements. The open source alternatives simply could not match cloud model performance on complex agent tasks. That has changed.
Where Gemma 4 makes the strongest case:
Workflows with sensitive data where you cannot send documents, customer records, or proprietary business logic to a third-party API. Gemma 4 runs fully offline. Data never leaves your infrastructure. This is the argument I see most often in healthcare, legal, and financial services contexts.
High volume repeatable agent tasks where per-token costs accumulate. An invoice processing agent handling 10,000 documents per month has a very different economics conversation with a self-hosted model versus a cloud API charging by the token.
Edge and on-device applications where you need local AI without round-trip API latency. E2B and E4B are now the best options in their class, supporting text, image, audio, and video in under 8GB with embeddings.
Regulated industries where data residency requirements make cloud AI processing legally complicated. Healthcare organizations under HIPAA, financial firms under various compliance frameworks, and government clients frequently cannot use cloud-hosted AI for certain workflows.
Where you still want cloud APIs: Tasks requiring frontier reasoning capability where GPT-4o and Claude still lead. Irregular workloads where you do not want to manage infrastructure. Multimodal workflows requiring audio on larger models (Gemma 4 audio is limited to E2B and E4B).
I have written a detailed breakdown of this decision in When to Use AI Agents vs Automation if you want the full framework. For the past several years I have been deploying agents across healthcare, ecommerce, legal, and logistics contexts. Take a look at my case studies to see how these trade-off decisions play out in real deployments. The consistent pattern: privacy requirements and cost pressure are the two forces that push businesses toward self-hosted models, and Gemma 4 is the first open source option that does not require a meaningful capability compromise on agentic tasks in exchange.
If you are not sure whether your business needs AI agents at all, or whether simpler automation would get the job done, my AI Agent Readiness Assessment takes about 12 minutes and gives you a scored report across 8 dimensions. It is free.
How to Get Started With Gemma 4 Today
If you want to test Gemma 4 before committing to any infrastructure, Google AI Studio is the fastest path. Gemma 4 31B and 26B are available with no credit card required. You can run agentic tasks, test function calling, and try thinking mode within minutes.
For local deployment, Ollama is the easiest route:
ollama pull gemma4:31b
ollama run gemma4:31b
For the MoE model with lower active-parameter costs:
ollama pull gemma4:26b
ollama run gemma4:26b
Minimum hardware requirements:
- E2B: 4GB RAM, runs on CPU (Raspberry Pi 5 supported)
- E4B: 8GB RAM, 12 to 16GB VRAM for GPU acceleration
- 26B A4B: 32GB Mac or 16 to 24GB VRAM with Q4 quantization
- 31B Dense: 48GB Mac or single H100 80GB (bfloat16), 24GB VRAM with Q4
For Python integration with thinking mode enabled:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "google/gemma-4-31b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
# Enable thinking mode
inputs = tokenizer.apply_chat_template(
messages,
return_tensors="pt",
enable_thinking=True
)
For agentic workflows, the 31B model supports native function calling following an OpenAI-compatible tool schema. If you are building the full agent architecture around it, the production AI agents guide covers memory, orchestration, and reliability patterns that apply regardless of which model you choose. Most frameworks that already work with OpenAI function calling (LangChain, LlamaIndex, AutoGen) will integrate with Gemma 4 31B with minimal changes to existing code.
If you want Google-managed infrastructure without local hardware, Vertex AI hosts Gemma 4 inside your GCP project. You get data privacy within your Google Cloud environment while Google handles availability and scaling.
The Real Question for Your Business
Gemma 4 is a technically impressive release and the Apache 2.0 license makes it commercially clean. But the real question is not whether this is a good model. It clearly is. The question is whether it changes what makes sense for your specific situation.
For most small and medium businesses starting fresh, the answer is still probably: begin with cloud APIs and migrate to self-hosted when cost or compliance creates enough pressure to justify the infrastructure work. Gemma 4 makes that future migration easier and the endpoint more capable, but the migration itself still requires real work.
For businesses already running significant AI agent workloads on cloud APIs and feeling the monthly cost, or for companies in regulated industries where cloud AI processing creates compliance risk, Gemma 4 31B is now a production-ready option that genuinely was not available four months ago.
If you want to figure out exactly where your business sits in this picture, my AI Agent Readiness Assessment scores you across 8 dimensions and gives you a personalized report in about 12 minutes.
For businesses that already know they need to build and want a clear implementation plan, get in touch and let us talk through the architecture decisions, including whether Gemma 4 makes sense for your use case.
Citation Capsule: Gemma 4's 31B model scores 86.4% on tau2-bench for agentic tasks, up from 6.6% in Gemma 3. HuggingFace Blog 2026. The 26B A4B achieves approximately 97% of 31B performance with only 3.8B active parameters. Google DeepMind 2026. On LMArena, Gemma 4 31B ranks #3 among open models globally and #27 overall including closed frontier models. LMArena 2026. The E2B edge model achieves 3,700 tokens per second prefill speed on a Qualcomm Dragonwing chip with NPU acceleration. Google Developers Blog 2026.
Frequently Asked Questions
What does E2B and E4B mean in Gemma 4?
The "E" stands for effective parameters, not total. E2B has 2.3 billion effective parameters but 5.1 billion total parameters including embeddings. Google uses a technique called Per-Layer Embeddings (PLE) that injects a residual signal into every decoder layer, giving the small model the representational depth of a much larger one. This allows E2B to run on a Raspberry Pi 5 or an Android phone while performing significantly above its weight class on benchmarks.
Is Gemma 4 truly open source?
Gemma 4 is released under Apache 2.0, one of the most permissive open source licenses available. You can use it commercially, modify it, build products on it, and charge for those products without paying Google anything and without restrictions on user counts. This is notably different from Meta's Llama 4, which uses a community license that requires explicit Meta approval for deployments beyond 700 million monthly active users.
What GPU do I need to run Gemma 4 31B?
For the full bfloat16 version, you need a single NVIDIA H100 80GB. With Q4 quantization, which maintains near-identical benchmark performance for most use cases, you can run it on a 24GB GPU or a 48GB Mac Studio. NVIDIA also offers NVFP4 quantized checkpoints specifically optimized for Blackwell and H100 hardware for even lower memory requirements.
What is tau2-bench and why does it matter for AI agents?
tau2-bench (Tool-Agent-User Interaction benchmark) measures how well an AI model performs on real-world agentic tasks: multi-step planning, tool calling, handling ambiguous instructions, and completing goals in external systems. Most AI benchmarks test knowledge or code generation in isolation. tau2-bench tests the behaviors that matter when you are building AI agents that interact with business systems. Gemma 4 31B's score of 86.4%, up from 6.6% in Gemma 3, represents the difference between a model that occasionally handles agent tasks and one that reliably handles them.
Does Gemma 4 support audio input?
Yes, but only the E2B and E4B edge models. They include a USM-style conformer audio encoder that handles automatic speech recognition and speech-to-translation for up to 30 seconds of audio input. The encoder is trained on speech only, not music. The 26B and 31B models do not include audio at this time, though they support text, images, and video up to 60 seconds.
How does Gemma 4 compare to GPT-4o or Claude?
On LMArena, the community-voted benchmark covering real-world use cases, Gemma 4 31B ranks #3 among open models and #27 overall including closed models like GPT-4o and Claude. Closed frontier models still lead on the absolute top of the reasoning distribution. But Gemma 4 31B is now close enough that for most business agent use cases, the capability difference is smaller than the practical advantages of self-hosting: data privacy, no per-token costs, and no external API dependencies.
Can Gemma 4 run on a phone or mobile device?
Yes. The E2B model runs on Android devices with AICore-enabled NPUs at 31 tokens per second decode speed. With 2-bit or 4-bit quantization, it fits under 1.5GB. Google has released an ML Kit Prompt API for integrating E2B and E4B into Android and iOS apps with tool calling and structured output support. On a Qualcomm Dragonwing chip with full NPU utilization, E2B reaches 3,700 tokens per second prefill speed.
Where can I try Gemma 4 for free?
Google AI Studio offers free access to Gemma 4 31B and 26B with no credit card required. Kaggle provides free notebooks with GPU access. All model weights are free to download from Hugging Face Hub at google/gemma-4-e2b-it, google/gemma-4-e4b-it, google/gemma-4-26b-a4b-it, and google/gemma-4-31b-it. Local deployment via Ollama or LM Studio is also free, limited only by your own hardware.
Related Posts

AI Agents Are Coming for Your SaaS Stack and VCs Are Betting Billions on It

AI Is Now As Good As Humans at Using Computers. Here Is What $297 Billion in Q1 Funding Says About What Comes Next.

n8n 2.0 AI Agents: The Workflow Architecture I Use Across Every Client Deployment

Jahanzaib Ahmed
AI Systems Engineer & Founder
AI Systems Engineer with 109 production systems shipped. I run AgenticMode AI (AI agents, RAG systems, voice AI) and ECOM PANDA (ecommerce agency, 4+ years). I build AI that works in the real world for businesses across home services, healthcare, ecommerce, SaaS, and real estate.