Google's Gemini 3.5 Flash costs 3x the model it replaced, and the era of cheap AI is ending - XDA - News Bunkers

For most of the past couple years, AI pricing has been in flux, seemingly getting cheaper or only slightly increasing. Every new model from a major lab seemed to arrive as something faster, smarter, and sometimes even cheaper than the one before it, and the cheapest tier of all was the one you reached for when you wanted good-enough intelligence at throwaway prices. Google calls its version Flash. OpenAI has its mini and nano models, Anthropic has Haiku, and the assumption underneath all of them was simple. Next year’s budget model would cost about the same as the previous year, or it might even be a little bit cheaper.
Google’s Gemini 3.5 Flash has shattered that assumption; released on May 19, it costs $1.50 per million input tokens and $9 per million output tokens. The model it effectively replaces, Gemini 3 Flash, ran at $0.50 and $3, making for a three-times increase on the price for the tier that’s supposed to be the cheap one, and it’s the clearest sign yet that the era of subsidized AI is coming to an end.
There’s a right reading of all of this and a wrong one; AI isn’t getting more expensive to run, at least not in the way the raw numbers make it look. The cost of a given level of intelligence is still falling, and falling pretty quickly. What has changed is that the major labs have stopped handing those savings back to customers, and they’ve started quietly testing how much people will actually pay. DeepSeek is doing the exact opposite, and it’s winning customers because of it.
Google’s Flash price creep didn’t start with this release, and Gemini 2.5 Flash arrived in June 2025 at $0.30 per million input tokens and $2.50 for output. Gemini 3 Flash bumped that to $0.50 and $3. Now 3.5 Flash sits at $1.50 and $9, which works out to five times the input price of the 2.5 model from less than a year earlier: the line has only been going up.
For Gemini 3.1 Pro, Google charges $2/M input and $12/M output for prompts up to 200K tokens, rising to $4/M and $18/M above that. That still leaves 3.5 Flash unusually close to Pro pricing for shorter prompts, especially given that the gap between the cheap model and the expensive one used to be the entire reason Flash existed. It was the model you pointed at high-volume, low-stakes work; tasks like basic summarization, tagging, or classification are exactly what those Flash models are for. At $1.50 and $9, the gap has narrowed to the point where choosing between Flash and Pro is no longer obvious on price alone.
Rates, though, are only part of the equation. Artificial Analysis, an independent evaluator, ran its full benchmark suite across both models and found that 3.5 Flash cost roughly 5.5 times more to run than the previous Flash, because it both charges more per token and generates more of them on the multi-step agentic work it’s built for. Artificial Analysis reported that running their suite on 3.5 Flash cost about $1,550, while running it on the higher-tier 3.1 Pro came in at around $890. The cheap model was the expensive one to actually use.
The way Flash gets used now makes that jump a lot more painful than the rate card suggests. A modern agent doesn’t send one prompt and read one reply. It loops, calls tools, re-reads its own context, and fires off dozens of requests to finish a single task, so a three-times increase in the base rate compounds across every one of them.
Google hasn’t really hidden any of this, but the company hasn’t gone out of its way to draw attention to it either. Sundar Pichai described 3.5 Flash as delivering frontier-level capability at less than half the price, which is true on the surface. With that framing, he was comparing it to rival frontier models from OpenAI and Anthropic, not to Google’s own previous Flash. In that comparison, it’s a sizable increase, but Google’s framing is doing a lot of work to keep that from being the headline.
If this were a one-off, it would be easy to write off as Google repositioning a single product, but unfortunately, it happened across the industry in the same week. OpenAI’s GPT-5.5 doubled GPT-5.4’s standard per-token pricing: from $1.25/$7.50 to $2.50/$15 for short context, and from $2.50/$11.25 to $5/$22.50 for long context.
Anthropic took a quieter route with Claude Opus 4.7. It kept the headline rate the same as Opus 4.6, then changed the tokenizer, meaning the same input can break into more billable tokens. Anthropic said the increase is roughly 1.0 to 1.35x depending on content, while Simon Willison found a 1.46x increase on the Opus 4.7 system prompt, making the model effectively more expensive for some text workloads despite the unchanged sticker price. GitHub, meanwhile, moved Copilot toward token-based billing. Three companies moved in one direction, though they did it with three different mechanisms.
On the outside, it looks like all three companies are starting to probe the price tolerance of their API customers. It’s unlikely to be collusion in that sense, but each company is reading the same market, seeing pressure from investors, and taking action to extract more revenue streams from customers. Much of AI still isn’t profitable, so it makes sense… to a degree.
What’s especially interesting is that these models are the workhorse models, specifically the ones that people wire into production and run real workloads on. Raising prices on a research preview affects a handful of early adopters who knew they were paying for the bleeding edge. Raising them on Flash, GPT-5.5, and Opus hits everyone.
The reason any of this is happening comes down to economics that were never sustainable, as running these models at the prices customers got used to has been a money-losing exercise for the companies offering them. OpenAI reportedly pulled in around $3.7 billion in 2025 and lost roughly $5 billion doing it, which works out to spending about $1.35 for every dollar it took in. Its cumulative cash burn is projected to reach something like $115 billion by 2029. There’s one apparent exception, though: Anthropic, with caveats.
Anthropic is expected to turn a profit in the second quarter of 2026, with $559 million from $10.9 billion in revenue, after losing money the quarter before. If it holds up to scrutiny, it would make Anthropic the first major lab to show working economics, but the problem is what sits underneath the number. One reason to be skeptical is timing: Ed Zitron has pointed to SpaceX’s filing language indicating Anthropic’s Colossus compute payments ramp in May and June at a reduced fee, before rising to $1.25 billion a month. If that reduced-fee ramp is included in Q2 costs, the quarter could look unusually profitable without proving that Anthropic’s underlying serving economics have changed.
Plus, Anthropic isn’t public and isn’t held to standard accounting rules, and it’s also convenient that it just happened to surface right as it was raising a new round and Nvidia was posting earnings. Anthropic itself said that it might not stay profitable for the full year once spending picks back up, so when you take out the one-off, Anthropic doesn’t appear to be so different from anyone else. It’s a real profit on paper even if it’s just a quirk of timing, but don’t expect it to last.
The choice to lose money on revenue generated has been a deliberate choice by all of the labs, rather than an accident. For years the goal has been growth, and investors rewarded user numbers and adoption over profit, the same as many other start-ups plagued with the same issues over the past decade and longer. Cheap or free access was the price of grabbing market share before a competitor could, and the bet was that the economics could be sorted out later once enough people were locked in and dependent. Every lab made some version of that wager, and Russ Hanneman of Silicon Valley put it best:
“It’s not about how much you earn, it’s about what you’re worth. And who’s worth the most? Companies that lose money.”
Later has, unfortunately, arrived. Wall Street has shifted from rewarding growth to demanding revenue, and the easiest lever any of these companies can pull is the one customers feel most directly, even if it isn’t the case that these models are getting more expensive to run. The cost of a fixed level of capability has kept falling, and Epoch AI-linked work has estimated that the price of reaching a given benchmark capability level has fallen roughly 5 to 10 times per year, depending on the benchmark and methodology. The labs could keep passing those gains along, and for a while they did, but they’re choosing not to anymore.
AI isn’t getting more expensive to run: it’s that the companies running it have stopped sharing the savings and have started charging a premium for their best. The low rates that made a product viable were never really the lab’s price. They were a venture-funded discount designed to acquire you, and now it’s being withdrawn.
It isn’t only developers feeling the squeeze. The same logic applies to the subscriptions most people actually pay for, like the twenty-dollar-a-month plans for ChatGPT Plus and Claude Pro that have been sold below cost since the day they launched. The clearest sign of the turn is at the top of the menu. OpenAI added a $200-a-month ChatGPT Pro tier at the end of 2024, and Anthropic followed in April 2025 with Claude Max at $100 and $200, sold as five and twenty times the usage of the standard plan. A year before that, the flat twenty-dollar plan was the entire consumer offering. Sam Altman admitted within days of launching Pro that OpenAI was losing money on it because subscribers used it far more than expected, calling it an “insane thing,” and floated usage-based pricing as the fix. Quite a surprising thing to concede about the most expensive product you sell.
Anthropic has been the most active at working the other side of the deal, tightening what the flat fee actually buys. In August 2025 it added weekly rate limits to Pro and Max, capping the heaviest users, the ones running Claude Code more or less around the clock, and offering to sell them extra capacity at full API rates once they hit the wall. The all-you-can-eat plan was being put on a meter, and the months since have been a steady run of that meter getting tighter.
Some of it barely showed. At some point, Anthropic gave Claude Code a one-hour prompt cache, which lets a session reuse its context cheaply instead of re-uploading it every turn, then shortened that window to five minutes weeks later without telling anyone. After that, any pause longer than five minutes expired the cached context, so the next message rewrote it at full rate instead of reading it back at a tenth of the price. Subscribers who had never come near their limits watched their quotas drain, with cache costs up 20 to 30%. Anthropic said it shouldn’t cost more, but users logged their usage and could see when it did.
Anthropic got clumsy, too. In April 2026, Anthropic decided that third-party agent harnesses, the open-source OpenClaw and Nous Research’s Hermes among them, could no longer run on Pro and Max quotas and would be routed to pay-as-you-go API billing instead. I get why, as these subscriptions are heavily subsidized and aimed at humans on a keyboard. Still, the methods were… pretty terrible. Claude Code included recent git/repository context in its prompt, and Anthropic’s third-party harness detection appears to have scanned that prompt for strings such as OpenClaw or Hermes. By late April, users found that simply having a file named HERMES.md in their Git history was enough to trip the detector and knock a session off the subscription mid-task onto API rates. One user was charged more than $200 in overages with 86% of their prepaid plan still unused, then told by support it was an unrefundable technical error. Anthropic eventually reversed and paid it back.
All of this came in the same stretch of time that users noticed Claude Code had dropped off the list of features included in the $20 Pro plan. Anthropic called that one a small A/B test on a sliver of new signups and restored it inside a day once the backlash hit, promising any real change would reach existing subscribers from the company rather than a screenshot on X. The company then later announced a restriction on programmatic usage, namely the Agent SDK and claude -p usage, where users can claim extra usage credits matching their subscription cost that will allow for those tools to be used. All of this taking place in just a few months reads like a company desperate to cut its costs as much as possible in a way that angers the fewest users.
OpenAI has been pulling the same levers with a lighter touch. It’s steadily tightened ChatGPT Plus through 2026, trimming message caps, paring back how much of its Codex coding tool the $20 plan includes, and steering anyone whose usage starts to look serious toward a higher tier or metered credits. The difference is mostly one of finesse. OpenAI has avoided the self-inflicted billing fiascos, but the direction is identical, and the old sense that a Plus subscription covered almost anything you cared to throw at it is going away.
It isn’t only the American labs, either. When Claude’s limits got too unpredictable, I looked at the alternatives on the market and tried Z.ai’s GLM coding plan, one of the wave of inexpensive Chinese subscriptions pitched straight at people priced out of Claude Code, but it banned my account. The notice told me my usage pattern did not comply with its fair use policy, which was a surprise, because I had run a single test task and one small tweak, around 7.4 million tokens across a week, nowhere near the quota. Z.ai had just rolled out its own version of the OpenClaw crackdown, restricting its coding plans to officially supported tools and blaming a surge of malicious users for the strain on its servers, and its automated risk control swept me up with no warning and no obvious way back. I was left with a plan I had paid for and barely used, and I was eventually unbanned nearly a week later.
The pattern holds across the entire AI space right now. The cheap plan is the one with the most to lose when people actually lean on it, so it ends up being the one most watched, and it doesn’t seem to matter whether it comes from San Francisco or Beijing. However, that makes the one company spending 2026 doing the exact opposite such an outlier.
While the Western labs nudge prices up, DeepSeek has spent 2026 doing the reverse. On May 22, the Chinese lab made its promotional 75% cut to its flagship V4-Pro permanent, pricing it at roughly $0.44 per million input tokens and $0.87 for output. Its lighter V4-Flash model goes lower still, at $0.14 and $0.28. Those numbers undercut GPT-5, Opus 4.7, and Gemini Flash by a wide margin. The company didn’t state why it did it, but analysts have stated that they primarily believe the company is passing on efficiency gains to consumers.
DeepSeek’s V4 paper gives everyone a deep level of insight into its newest models, and it describes a hybrid sparse-attention design that, at a million tokens of context, runs on roughly 27% of the per-token compute and 10% of the memory of its predecessor. One part of the architecture compresses the model’s memory of the conversation aggressively, another picks out only the most relevant chunks to read in full, and the combination of both is what makes very long context cheap to serve. The price drop is built into how the model works.
A subsidized price can be undercut by a rival with deeper pockets, and it goes away when the funding behind it does. A structural efficiency advantage is a different thing entirely. You can’t match the pricing of a model with a structural advantage forever without building the same architecture, meaning DeepSeek can hold these prices steady while competitors can’t.
It isn’t a cheap-but-dumb tradeoff either. V4-Pro is commonly placed as the second-strongest open-weight reasoning model anywhere, behind only Kimi K2.6. That’s frontier-adjacent quality made available at a fraction of frontier prices, and DeepSeek’s share of developer traffic on routing platforms like OpenRouter has been climbing sharply as a result.
To be clear, DeepSeek made this move with a funding round on the horizon, so the timing wasn’t purely generosity, and a well-judged price war is a useful way to look dominant ahead of a raise. The efficiency is real all the same, and while one set of companies is raising prices because the subsidies it relied on stopped making sense, another is cutting them because it found a cheaper way to run the same workload. They can’t both define where the market settles, and right now the cheaper one is gaining ground.
There’s a split forming, where frontier-grade intelligence is being sold as a premium product aimed at the small amount of work that requires it. Everything else, like summarizing, drafting, extraction, and grunt development work gets served by cheaper models. Many models capable of filling that gap are open-weight models, and many of them are coming out of China. Gemini Flash getting pricier and DeepSeek getting cheaper in the same month are two halves of that same shift.
The cheap AI everyone got comfortable with isn’t going away. It’s just going to come from a different set of companies than the ones that got us all hooked in the first place.
We want to hear from you. Share your perspective in the comments below, and please keep the conversation respectful.
Your comment has not been saved
This space is open for discussion.
Be the first to share your thoughts.

source

Google's Gemini 3.5 Flash costs 3x the model it replaced, and the era of cheap AI is ending – XDA

Leave a Reply Cancel Reply