Gemma 4 and the Apache Pivot

Google released Gemma 4 today. Four model sizes, multimodal across the board, and a license change that matters more than any benchmark number on the page.

The headline spec is a family of open-weight models built from the same research as Gemini 3. There is a 31B dense model, a 26B mixture-of-experts variant that activates only 4 billion parameters at inference time, and two edge-optimised models (E4B and E2B) small enough to run on a Raspberry Pi 5. The context windows stretch to 256K tokens on the larger models and 128K on the smaller ones. All four handle images and text natively. The edge models add audio input. The larger two process video.

None of that is the story.

The story is Apache 2.0.

Gemma 3 shipped under a custom license, Google's own "Gemma Terms of Use," which imposed restrictions that made legal teams nervous and hobbyists indifferent. It was open in the way that a restaurant with a dress code is open. You could walk in, but the terms reminded you this was someone else's house. Gemma 4 drops all of that. Apache 2.0 means no usage caps, no commercial restrictions, no "contact us if you exceed 700 million monthly active users" clause like Meta's Llama license carries. Fork it, ship it, sell it, modify it without asking. The freedom is unconditional.

This is Google choosing to compete on capability rather than control. And the capability argument is strong. The 31B dense model ranks third on the Arena AI text leaderboard. The 26B MoE variant, running on just 4 billion active parameters, sits sixth, outperforming models with twenty times its effective compute budget. Google's own framing is "intelligence per parameter," and the numbers back it up. A model that small matching frontier-class open weights running at 100B+ parameters is not incremental progress.

The architecture has some genuinely interesting choices. Alternating attention layers split work between local sliding-window attention (512 or 1024 tokens depending on model size) and global full-context layers. Each attention type gets its own RoPE configuration: standard for local, proportional for global. A feature called Per-Layer Embeddings feeds a secondary signal into every decoder layer, combining token identity with contextual information, which seems to be how they squeeze so much quality out of fewer parameters. The shared KV cache reuses key-value tensors from earlier layers in later ones, cutting memory without obvious quality loss. It is a dense collection of efficiency tricks that compound.

The on-device numbers are where this gets practical. On a Raspberry Pi 5, the E2B model hits 133 tokens per second on prefill and 7.6 tokens per second on decode, using less than 1.5GB of memory with 2-bit quantization. Four thousand input tokens across two distinct skills process in under three seconds on mobile GPU. These are not synthetic benchmarks designed to flatter a press release. Raspberry Pi inference is the kind of thing people will actually try within hours of a release, and if those numbers hold, this becomes the default local model for a lot of embedded and mobile work.

I keep circling back to the agentic framing. Google is not positioning Gemma 4 as a chatbot engine. The marketing language says "purpose-built for advanced reasoning and agentic workflows," and the tooling reflects it: constrained decoding for structured outputs, multimodal function calling, GUI element detection, object detection and pointing. These are the primitives you need for an AI agent that can look at a screen, understand what it sees, decide what to do, and call the right function. The fact that it works offline, on a phone, without phoning home to a cloud endpoint, makes the agentic pitch credible in a way that server-dependent agents never quite were.

The ecosystem support at launch is unusually comprehensive. Day-one availability across Hugging Face Transformers, llama.cpp, MLX for Apple Silicon, Ollama, mistral.rs, ONNX, and browser-based inference through WebGPU via transformers.js. Google clearly pre-coordinated with the major frameworks. When I wrote about model discovery and pricing a couple of weeks ago, the friction was still in finding and deploying the right model. Gemma 4 arrives already integrated into every tool people actually use.

What Google is doing here has a clear strategic logic. The Gemini 3.1 Pro updates showed them closing the gap with Claude and GPT on their proprietary side. Now the open side gets a model built from the same research foundations, under the most permissive license in the major open-weight landscape. Meta's Llama has its commercial threshold. Mistral has been ambiguous about which models are truly open. Google just removed every legal obstacle at once.

The 140+ language support is quietly significant. Most open models optimise for English with a handful of other languages bolted on. Google's multilingual training infrastructure, built for Search over two decades, gives Gemma 4 a natural advantage here. For developers building products outside the English-speaking world, this might be the deciding factor regardless of benchmark position.

I'm less certain about the video capabilities in the larger models. Processing video natively is useful, but the context window arithmetic gets expensive fast. A few minutes of video at reasonable frame rates will consume a large fraction of that 256K window, leaving limited room for reasoning about what was seen. The image and audio capabilities feel more immediately practical, especially on the edge models where audio input enables real-time speech understanding directly on device.

The competitive pressure this creates is substantial. Llama 4 from Meta is the obvious comparison, and Meta's response will need to address both the licensing gap and the efficiency gap. A 4B active parameter model matching 100B+ models on quality is the kind of result that forces everyone else to rethink their architecture, not just their marketing. Qwen, Phi, and the rest of the open-weight field now have a new bar to clear, set by a company with functionally unlimited compute and training data.

Whether Gemma 4 becomes the default open model depends on what happens in the next few weeks as developers actually stress-test these claims. Arena scores and launch-day benchmarks are one thing. Sustained performance across real workloads, fine-tuning stability, and the texture of outputs on tasks that benchmarks do not measure will determine if this is the model people reach for by default or just another strong option in an increasingly crowded field.

The Apache 2.0 move, though, is irreversible. Google cannot walk that back without destroying trust. And for every developer who avoided Gemma 3 because of licensing uncertainty, the door is now wide open.

Sources:

Gemma 4: Byte for byte, the most capable open models - Google Blog
Welcome Gemma 4: Frontier multimodal intelligence on device - Hugging Face
Bring state-of-the-art agentic skills to the edge with Gemma 4 - Google Developers Blog
Google announces open Gemma 4 model with Apache 2.0 license - 9to5Google
Google's new Gemma 4 models bring complex reasoning skills to low-power devices - SiliconANGLE
Google releases Gemma 4, a family of open models built off of Gemini 3 - Engadget

Plutonic Rainbows

Gemma 4 and the Apache Pivot

Recent Entries