Google released Gemma 4 today. Four model sizes, multimodal
across the board, and a license change that matters more than
any benchmark number on the page.
The headline spec is a family of open-weight models built from
the same research as Gemini 3. There is a 31B dense model, a
26B mixture-of-experts variant that activates only 4 billion
parameters at inference time, and two edge-optimised models
(E4B and E2B) small enough to run on a Raspberry Pi 5. The
context windows stretch to 256K tokens on the larger models
and 128K on the smaller ones. All four handle images and text
natively. The edge models add audio input. The larger two
process video.
None of that is the story.
The story is Apache 2.0.
Gemma 3 shipped under a custom license, Google's own "Gemma
Terms of Use," which imposed restrictions that made legal
teams nervous and hobbyists indifferent. It was open in the
way that a restaurant with a dress code is open. You could
walk in, but the terms reminded you this was someone else's
house. Gemma 4 drops all of that. Apache 2.0 means no usage
caps, no commercial restrictions, no "contact us if you
exceed 700 million monthly active users" clause like Meta's
Llama license carries. Fork it, ship it, sell it, modify it
without asking. The freedom is unconditional.
This is Google choosing to compete on capability rather than
control. And the capability argument is strong. The 31B dense
model ranks third on the Arena AI text leaderboard. The 26B
MoE variant, running on just 4 billion active parameters,
sits sixth, outperforming models with twenty times its
effective compute budget. Google's own framing is
"intelligence per parameter," and the numbers back it up. A
model that small matching
frontier-class open weights
running at 100B+ parameters is not incremental progress.
The architecture has some genuinely interesting choices.
Alternating attention layers split work between local
sliding-window attention (512 or 1024 tokens depending on
model size) and global full-context layers. Each attention
type gets its own RoPE configuration: standard for local,
proportional for global. A feature called Per-Layer
Embeddings feeds a secondary signal into every decoder layer,
combining token identity with contextual information, which
seems to be how they squeeze so much quality out of fewer
parameters. The shared KV cache reuses key-value tensors from
earlier layers in later ones, cutting memory without obvious
quality loss. It is a dense collection of efficiency tricks
that compound.
The on-device numbers are where this gets practical. On a
Raspberry Pi 5, the E2B model hits 133 tokens per second on
prefill and 7.6 tokens per second on decode, using less than
1.5GB of memory with 2-bit quantization. Four thousand input
tokens across two distinct skills process in under three
seconds on mobile GPU. These are not synthetic benchmarks
designed to flatter a press release. Raspberry Pi inference
is the kind of thing people will actually try within hours of
a release, and if those numbers hold, this becomes the
default local model for a lot of embedded and mobile work.
I keep circling back to the agentic framing. Google is not
positioning Gemma 4 as a chatbot engine. The marketing
language says "purpose-built for advanced reasoning and
agentic workflows," and the tooling reflects it: constrained
decoding for structured outputs, multimodal function calling,
GUI element detection, object detection and pointing. These
are the primitives you need for an AI agent that can look at
a screen, understand what it sees, decide what to do, and
call the right function. The fact that it works offline, on a
phone, without phoning home to a cloud endpoint, makes the
agentic pitch credible in a way that server-dependent agents
never quite were.
The ecosystem support at launch is unusually comprehensive.
Day-one availability across Hugging Face Transformers,
llama.cpp, MLX for Apple Silicon, Ollama, mistral.rs, ONNX,
and browser-based inference through WebGPU via
transformers.js. Google clearly pre-coordinated with the
major frameworks. When I wrote about
model discovery and pricing
a couple of weeks ago, the friction was still in finding and
deploying the right model. Gemma 4 arrives already integrated
into every tool people actually use.
What Google is doing here has a clear strategic logic. The
Gemini 3.1 Pro
updates showed them closing the gap with Claude and GPT on
their proprietary side. Now the open side gets a model built
from the same research foundations, under the most permissive
license in the major open-weight landscape. Meta's Llama has
its commercial threshold. Mistral has been ambiguous about
which models are truly open. Google just removed every legal
obstacle at once.
The 140+ language support is quietly significant. Most
open models optimise for English with a handful of other
languages bolted on. Google's multilingual training
infrastructure, built for Search over two decades, gives
Gemma 4 a natural advantage here. For developers building
products outside the English-speaking world, this might be
the deciding factor regardless of benchmark position.
I'm less certain about the video capabilities in the larger
models. Processing video natively is useful, but the context
window arithmetic gets expensive fast. A few minutes of video
at reasonable frame rates will consume a large fraction of
that 256K window, leaving limited room for reasoning about
what was seen. The image and audio capabilities feel more
immediately practical, especially on the edge models where
audio input enables real-time speech understanding directly
on device.
The competitive pressure this creates is substantial. Llama
4 from Meta is the obvious comparison, and Meta's response
will need to address both the licensing gap and the
efficiency gap. A 4B active parameter model matching 100B+
models on quality is the kind of result that forces everyone
else to rethink their architecture, not just their marketing.
Qwen, Phi, and the rest of the open-weight field now have a
new bar to clear, set by a company with functionally
unlimited compute and training data.
Whether Gemma 4 becomes the default open model depends on
what happens in the next few weeks as developers actually
stress-test these claims. Arena scores and launch-day
benchmarks are one thing. Sustained performance across real
workloads, fine-tuning stability, and the texture of outputs
on tasks that benchmarks do not measure will determine if
this is the model people reach for by default or just
another strong option in an increasingly crowded field.
The Apache 2.0 move, though, is irreversible. Google cannot
walk that back without destroying trust. And for every
developer who avoided Gemma 3 because of licensing
uncertainty, the door is now wide open.