For a decade, the foundational assumption of AI hardware design has been that one chip — well-designed, sufficiently powerful, architecturally versatile — could handle both ends of the AI workload spectrum. Train the model on it. Serve the model on it. Optimize for both. Sell one product.
On April 22, 2026, at Google Cloud Next in Las Vegas, Google formally abandoned that assumption.
Introduced at the event, the eighth-generation TPU 8t ("Sunfish") and TPU 8i ("Zebrafish") are not variants of a common design. They are two architecturally distinct chips, built from the ground up for fundamentally different tasks — the 8t for large-scale model training, the 8i for low-latency inference and agentic workloads. It is the first time in the decade-long history of Google's Tensor Processing Unit program that a single generation has shipped as two truly separate silicon designs.
The decision is not a product marketing choice. It is a technical acknowledgment that training and inference have diverged so dramatically in their computational demands that optimizing for both on a single chip is no longer feasible without compromising performance on each. And it carries implications that extend well beyond Google's hardware roadmap — because it signals a shift in how the entire AI industry will think about silicon design going forward.
Why Training and Inference Are Fundamentally Different Problems
To understand why Google made this split, it helps to understand exactly how different training and inference are as computational workloads — not in kind, but in the specific hardware bottlenecks each one creates.
Training: A Compute-Bound Batch Problem
Training a large AI model is, at its core, a massive, parallelizable matrix multiplication problem. You pass batches of training data through the model, compute the difference between predicted and actual outputs (the loss), propagate that error backwards through the network (backpropagation), and update the model's weights accordingly. Repeat billions of times.
This process is compute-bound: the limiting factor is raw arithmetic throughput — how many floating-point multiply-accumulate operations per second the hardware can execute. Memory access patterns are relatively predictable and amenable to batching. The operation runs for days, weeks, or months on racks of chips in a single enormous parallel job. Latency within individual operations matters less than aggregate throughput across the entire job.
The ideal training chip is therefore one that maximizes compute density, scales to enormous cluster sizes through high-bandwidth interconnects, and sustains high utilisation rates across long batch-processing jobs. Waste no arithmetic capacity. Move data predictably. Run for a long time without error.
Inference: A Latency-Bound, Memory-Hungry Streaming Problem
Inference — serving a trained model to users — is a different problem entirely.
When a user submits a prompt, the model generates a response token by token, one at a time, in a process called autoregressive decoding. Each new token requires reading the KV cache — the accumulated key-value states from all prior tokens in the conversation — from memory. As conversations grow longer, the KV cache grows larger. And every generated token requires a full pass through this accumulated state.
This process is memory-bandwidth-bound: the limiting factor is not compute throughput but the speed at which the chip can read model weights and KV cache from memory. For every token generated, the chip performs relatively few arithmetic operations but must fetch large quantities of data from memory. Underutilising the arithmetic units while the memory bus saturates is the defining inefficiency of inference workloads.
Simultaneously, inference runs millions of requests concurrently — each in a different stage of its response, with a different context length, demanding different amounts of KV cache memory. These are not uniform, predictable batch jobs. They are a high-velocity swarm of heterogeneous, latency-sensitive requests. The ideal inference chip maximises memory bandwidth, brings as much KV cache as possible on-chip (where access latency is an order of magnitude lower than DRAM), and minimises the latency between individual chip-to-chip communication.
These two profiles — compute-bound batch vs. memory-bandwidth-bound streaming — are not merely different points on the same spectrum. They are different optimisation problems. And designing for both simultaneously means accepting significant compromises on each.
This is the insight that drove Google's decision. Reasoning agents that plan, execute, and learn within continuous feedback loops cannot operate at peak efficiency on hardware that was originally optimised for traditional training or transactional inference; their operational intensity is fundamentally distinct.
Meet the TPU 8t: Training at Unprecedented Scale
The TPU 8t ("Sunfish") is the logical endpoint of a decade of training-focused TPU optimisation, pushed to a scale that the AI industry has not seen before.
Core Architecture
The 8t is fabricated on TSMC's N3 process family with HBM3E memory — 216 GB per chip running at 6,528 GB/s — and delivers 12.6 FP4 PFLOPs of peak compute per chip.
At the pod level, numbers grow dramatically. A single TPU 8t superpod scales to 9,600 chips, connected by a 3D Torus topology and holding 2 petabytes of shared HBM. Google claims 121 FP4 ExaFLOPs from a single superpod — approximately 2.8x the compute of an Ironwood superpod at the same price point. That 2.8x figure is the headline training improvement Google is using to characterise the generational step.
The 3D Torus connectivity doubles bidirectional scale-up bandwidth to 19.2 Tb/s per chip compared to the previous generation. Beyond the superpod, Google's new Virgo Network fabric connects up to 134,000 TPU 8t chips in a single non-blocking data center fabric delivering 47 PB/s of bisection bandwidth. Training jobs can scale beyond 1 million TPU chips across Virgo-connected deployments — a number that puts the addressable frontier model training problem in a different category than anything previously available.
Key Innovations
Native FP4 compute. The 8t introduces native FP4 arithmetic support, doubling MXU (Matrix Multiplication Unit) throughput at reduced precision compared to FP8. For frontier models trained in low-precision formats, this directly translates to more compute per watt and per dollar.
TPUDirect RDMA Storage. A new direct storage path bypasses the host CPU entirely, pulling data from Google's managed storage tier directly into HBM without CPU-mediated hops. Google reports ten times faster storage access over the previous generation — critical for long training runs where data loading can become a wall-clock bottleneck.
SparseCore retention. The 8t retains Google's SparseCore units for handling the irregular memory access patterns typical of embedding lookups during training — particularly relevant for large Mixture-of-Experts (MoE) models where different experts are activated for different inputs.
HBM3E over HBM4. A deliberate choice worth noting: the 8t uses HBM3E rather than the newer HBM4, accepting 11.5% less bandwidth than the previous Ironwood generation in exchange for improved yield and lower cost per chip. This signals that Google is optimising for cost-effective scale-out rather than per-chip peak performance — rational for training workloads where aggregate cluster compute matters more than individual chip bandwidth.
Design Partner: Broadcom
The TPU 8t is designed by Broadcom, continuing the relationship that has defined TPU silicon development since 2015. Broadcom's expertise in high-speed interconnect and large-scale ASIC design makes it a natural fit for a chip whose defining characteristic is how it connects to thousands of peers.
Meet the TPU 8i: Inference Reimagined for the Agentic Era
If the TPU 8t is an evolutionary step, the TPU 8i ("Zebrafish") is the more architecturally radical chip. It carries a fundamentally different memory architecture, a different interconnect topology, and a different on-chip functional block — each chosen to address the specific bottlenecks of modern inference workloads.
Core Architecture
The 8i is fabricated on the same TSMC N3 process family and carries 288 GB of HBM3E at 8,601 GB/s per chip — more memory and higher bandwidth than the 8t. It delivers 10.1 FP4 PFLOPs of peak compute per chip — slightly less than the 8t, reflecting the prioritisation of memory over raw arithmetic.
The inference pod numbers are where the 8i's design philosophy becomes clear. A single 8i pod scales to 1,152 chips — versus 9,600 for the 8t — but delivers 9.8x the FP8 ExaFLOPs per pod (11.6 vs 1.2) compared to Ironwood, and 6.8x the HBM capacity per pod (331.8 TB vs 49.2 TB). The pod-size difference between 8t and 8i is not a limitation — it reflects the fundamentally different scaling logic of serving workloads, where the bottleneck is per-request latency rather than aggregate batch throughput.
The Three Decisive Architectural Changes
1. Tripled on-chip SRAM (Vmem): 384 MB. This is the most consequential change in the 8i's design. Where Ironwood carried approximately 128 MB of on-chip SRAM, the 8i carries 384 MB — triple the amount.
The reason this matters comes back to KV cache. During long-context decoding, every generated token requires reading accumulated key-value states from prior tokens. On most accelerators, that data comes from HBM. On-chip SRAM bandwidth is roughly an order of magnitude higher than HBM. The 8i is sized to hold meaningful KV cache footprints entirely on silicon — meaning that for many practical inference requests, the KV read that would saturate the HBM bus on a standard chip happens at SRAM speed instead. Every KV read served from SRAM rather than HBM means shorter per-token latency and higher tokens-per-second at the same power envelope.
2. Boardfly topology. The 8i replaces the 3D Torus interconnect topology used in the 8t and previous TPU generations with a new topology Google calls Boardfly — designed specifically to reduce network diameter for MoE and reasoning workloads.
The insight behind this choice is fundamental: Google's default way of connecting chips together supported bandwidth over latency — good for moving large amounts of data through, not built for the minimum time it takes a response to get back. For training, high bandwidth moving large gradient updates around the cluster is what matters. For agentic inference — where a single reasoning step might involve activating many distributed experts and synchronising intermediate states across chips — the number of hops between any two chips in the pod determines per-step latency. Boardfly reduces network diameter by roughly 56% compared to the 3D Torus, cutting the hop count and directly reducing the latency floor for inter-chip collective operations.
3. Collective Acceleration Engine (CAE). The 8i introduces a new on-die hardware block — the Collective Acceleration Engine — that offloads reduction and synchronisation operations during autoregressive decoding. During inference, the most latency-sensitive collective operation is the all-reduce that synchronises activations across chips after each transformer layer. On previous architectures, this synchronisation was handled by general-purpose compute units. The CAE performs it in dedicated silicon, cutting on-chip collective latency by up to five times.
Combined with the tripled SRAM and Boardfly topology, Google claims 80% better performance per dollar over Ironwood for large MoE models at low-latency targets.
Design Partner: MediaTek
The TPU 8i is designed by MediaTek — a significant strategic development that the industry has not fully absorbed. MediaTek's core expertise is in mobile and edge silicon, where low power, high volume, and near-zero latency are the defining constraints. Bringing mobile-edge efficiency logic to the data centre is an explicit design philosophy: applying the principles that make smartphone AI chips power-efficient to the challenge of running millions of inference requests cost-effectively at cloud scale.
This choice also represents a diversification of Google's silicon supply chain. Broadcom had exclusive TPU design responsibility since 2015. MediaTek's entry into the programme ends that exclusivity and introduces a second architectural perspective — one optimised for the latency and efficiency characteristics of serving rather than training.
Head-to-Head: TPU 8t vs. TPU 8i
| Specification | TPU 8t ("Sunfish") | TPU 8i ("Zebrafish") |
|---|---|---|
| Design partner | Broadcom | MediaTek |
| Primary workload | Large-scale model training | Low-latency inference & agentic AI |
| Process node | TSMC N3 | TSMC N3 |
| Memory | 216 GB HBM3E @ 6,528 GB/s | 288 GB HBM3E @ 8,601 GB/s |
| Peak compute | 12.6 FP4 PFLOPs | 10.1 FP4 PFLOPs |
| On-chip SRAM | Standard (Ironwood-class) | 384 MB (3× Ironwood) |
| Pod size | 9,600 chips | 1,152 chips |
| Pod HBM | 2 PB | 331.8 TB |
| Pod compute | 121 FP4 ExaFLOPs | 11.6 FP8 ExaFLOPs |
| Interconnect topology | 3D Torus | Boardfly (56% lower network diameter) |
| Scale-up bandwidth | 19.2 Tb/s bidirectional | 19.2 Tb/s bidirectional |
| Scale-out fabric | Virgo (134,000+ chips) | Virgo (134,000+ chips) |
| Special on-die units | SparseCore (MoE training) | CAE (5× lower collective latency) |
| Storage access | TPUDirect RDMA (10× faster) | — |
| Performance gain vs Ironwood | 2.8× training price-performance | 80% inference price-performance |
| Max cluster scale | 1 million+ chips | Virgo fabric |
| Availability | Later in 2026 | Later in 2026 |
The Broader Signal: Why This Split Matters for the Entire AI Industry
Google's decision to bifurcate its TPU roadmap is not an isolated product strategy. It reflects a structural reality about how AI workloads are evolving in 2026 — and it sends a signal that the rest of the industry will need to respond to.
General-Purpose AI Silicon Is a Fading Category
Google's bifurcation of the TPU v8 line suggests a strategic bet that the general-purpose AI chip is a fading category, replaced by a world where training is a massive batch-process and inference is a high-velocity swarm activity.
The GPU — NVIDIA's H100, H200, and GB200 — succeeded precisely because it was general-purpose enough to serve both training and inference adequately. As AI workloads scale and specialize, "adequate for both" increasingly means "optimal for neither." The 8t vs. 8i split is the clearest institutional endorsement yet of the view that the tradeoffs have become too severe to continue designing a single chip to serve both.
Meta has revealed four new MTIA chips built for AI inference on a six-month cadence. Qualcomm is developing a discrete NPU with custom 3D DRAM specifically optimised for mobile inference. Amazon's Trainium (training) and Inferentia (inference) chips have maintained separate roadmaps for years. The pattern is consistent: every major consumer of AI compute at scale is independently arriving at the same conclusion — the workloads are too different to share a single architecture.
The Agentic Era Is the Forcing Function
The specific timing of Google's split is not accidental. The rise of AI agents — systems that reason through problems, execute multi-step workflows, and learn from their own actions in continuous loops — is the forcing function.
Agentic workloads impose demands on inference infrastructure that transactional inference does not. A single-turn chatbot response is bounded in context and predictable in structure. An agent that decomposes a task into subtasks, delegates them to sub-agents, aggregates results, and revises its plan based on outcomes is an extended, multi-step, high-memory, latency-sensitive operation. At scale — millions of agents running concurrently — the combination of long contexts, irregular activation patterns across MoE expert layers, and tight latency requirements creates a workload profile that no training-optimised chip handles efficiently.
The 8i's design responds to this specifically: the Boardfly topology reduces inter-chip synchronisation latency for the fast collective operations that agentic reasoning requires. The CAE hardware block accelerates the reductions that synchronise distributed agent state. The tripled SRAM holds more of each agent's context on-chip, avoiding the memory bandwidth saturation that makes long-context agentic inference slow on current hardware.
The NVIDIA Competition
Google is not competing with NVIDIA on raw per-chip performance — and notably, Google is not even comparing the performance of its new chips with those from the AI chip leader. The strategy is different: vertical integration from chip to model to serving infrastructure, with cost-per-token as the competition metric rather than peak FLOP/s.
NVIDIA's upcoming Vera Rubin ships in H2 2026 with 50 PFLOPS of NVFP4 inference per package, 288 GB of HBM4 at 22 TB/s, and NVLink 6 at 3.6 TB/s bidirectional. Rubin Ultra in H2 2027 targets 100 PFLOPS FP4 per package. Kyber NVL576 will bind 576 Rubin Ultra GPUs in a single rack at 15 EFLOPS FP4 inference. The per-chip comparison between 8t and NVIDIA's current best is competitive. The scale-up comparison is not close: a GB300 NVL72 rack carries 72 GPUs in a single NVLink domain. A TPU 8t superpod comprises 9,600 chips in a single 3D torus — 133 times more chips in a single collective domain, a gap that separates the platforms definitively for frontier training jobs.
NVIDIA retains an ecosystem advantage that Google cannot quickly close. TPUs remain primarily deployed within Google's infrastructure and offered through Google Cloud. Unlike NVIDIA's broad ecosystem spanning enterprise, cloud, workstation, and edge deployments, TPUs serve a narrower market. That focused approach enables the workload-specific optimisation that makes 8t and 8i architecturally compelling — but limits the addressable market to Google Cloud customers and a handful of strategic partners.
Notable among those partners: Anthropic has committed to using multiple gigawatts worth of Google TPUs. Meta is reportedly in discussions for a multi-billion-dollar deal to deploy Google TPUs starting in 2027. Citadel Securities and all 17 U.S. Energy Department national laboratories are operational TPU customers. The commercial case is strengthening.
What This Means for Enterprise AI Buyers in 2026
The TPU 8t and 8i are not yet available — both are scheduled for general availability "later in 2026." For enterprise teams evaluating AI infrastructure, the announcement reframes the procurement conversation in concrete ways.
Teams training large proprietary models should track 8t availability windows, Virgo networking access, and goodput SLAs. The 2.8x training price-performance claim, if validated by independent benchmarks, makes the 8t a genuinely competitive option against NVIDIA for teams running frontier training jobs on Google Cloud — particularly at the scale where 3D Torus cluster dynamics favour TPU architecture.
Teams serving agents or reasoning workloads should evaluate 8i availability on Vertex AI and monitor independent latency benchmarks as they emerge from early access customers. The 80% inference price-performance improvement over Ironwood, if it holds for their specific model architectures and context length distributions, is a meaningful operational cost reduction.
Teams consuming Gemini through Gemini Enterprise will inherit the 8i uplift automatically. As Google's own Gemini models migrate to 8i infrastructure, the ceiling on what enterprise Gemini customers can deploy in production — in terms of context length, response speed, and concurrent agent capacity — rises meaningfully through 2026.
The key caveat for all buyers: Google's benchmarks are self-reported. Independent numbers from early cloud customers and third-party evaluators will emerge over the next two quarters. The architectural decisions in both chips are sound and the reasoning behind them is well-documented — but procurement decisions of this scale should wait for validated third-party performance data before committing.
Conclusion: The Chip Is Always a Theory of What AI Will Become
Every chip design is a theory about what the most important workload will be when the chip ships. The TPU 8t is a theory that frontier model training will continue scaling, that the bottleneck will remain at the scale of collective communication across enormous chip clusters, and that the teams who can run larger training jobs more cheaply will build better models.
The TPU 8i is a theory that serving AI at the scale of millions of concurrent agents is the defining operational challenge of the next three years, that the memory bandwidth and collective latency constraints of current inference hardware are the primary bottleneck limiting what AI products can deliver, and that a chip purpose-built for those constraints will unlock a generation of AI applications that current hardware simply cannot run at commercially viable cost.
Both theories are well-grounded. Both will be tested by deployment at scale. And the decision to pursue them as separate chips rather than a single compromised platform is itself a statement: that the age of the general-purpose AI accelerator is over, and the age of purpose-built AI silicon has begun.
For the AI hardware industry, that is the most significant thing Google announced at Cloud Next 2026. Not a benchmark number. Not a price point. A philosophy.
Quick Reference: Google TPU 8th Generation at a Glance
| TPU 8t | TPU 8i | |
|---|---|---|
| Codename | Sunfish | Zebrafish |
| Built for | Training | Inference & agentic AI |
| Design partner | Broadcom | MediaTek |
| SRAM (on-chip) | Standard | 384 MB (3× Ironwood) |
| Pod compute | 121 FP4 ExaFLOPs | 11.6 FP8 ExaFLOPs |
| Key topology | 3D Torus | Boardfly |
| Special unit | SparseCore | CAE (Collective Accel. Engine) |
| Max cluster | 1M+ chips (Virgo) | Virgo fabric |
| Perf. vs Ironwood | 2.8× training | 80% inference price-perf. |
| Availability | Later 2026 | Later 2026 |
| Announced | Google Cloud Next, April 22, 2026 | Google Cloud Next, April 22, 2026 |

0 Comments