Your AI Got Worse Last Month

Photo by Annie Spratt on Unsplash
A couple of years ago a group of researchers at Stanford and Berkeley ran a simple experiment. They took the March 2023 version of GPT-4 and the June 2023 version and gave both the same questions. Same model, three months apart. On identifying whether a number was prime, GPT-4’s accuracy fell from 97.6 percent to 2.4 percent . On generating code that actually ran for a set of easy programming problems, the share that executed dropped from 52 percent to 10 percent.
None of it showed up in a changelog. The model just changed underneath the people building on top of it, and most of them found out the way everyone finds out, after something stopped working.
That experiment is old now, and the specific numbers have been argued over, but the mechanism it exposed hasn’t gone away. When you ship a feature on top of somebody else’s model, you don’t control the model. It updates on their schedule, for their reasons, and your outputs move with it. Sometimes for the better, sometimes not.
The gap where the damage happens
Across the teams I work with, the shape of it is consistent. A provider pushes an update. Over the next couple of weeks the output shifts. Users notice before the team does, because users are looking at individual answers while the team is looking at dashboards. Support tickets tick up and internal complaints start. Eventually someone investigates and finds that weeks of degraded output have already shipped to customers.
The whole stretch between the change and the discovery is unmeasured time. It’s preventable, and it usually isn’t prevented.
There are three reasons teams miss it. The first is that they have no baseline. You can’t detect a shift in quality if you never measured what “good” looked like to begin with, and most teams have never put a number on the quality of their AI output. The second is that whatever checking happens is manual and occasional. A person eyeballing ten responses on a Friday will not catch a small, consistent degradation spread across thousands of them. The third is that the things teams do measure are the wrong things. Latency, token count, and error rate tell you the system is running. They tell you nothing about whether the answers are any good.
Watching the pipe, not the water
This is starting to change. New Relic’s 2025 observability survey found that AI monitoring adoption climbed from 42 percent to 54 percent in a single year , the first time it crossed into the majority of organizations.
That’s real progress. Close to half of the organizations surveyed still have nothing watching their AI at all. And “monitoring” in most of these setups means uptime, latency, and cost. That tells you the system is up. It does not tell you the output got worse.
I’d draw a hard line between infrastructure monitoring and output-quality monitoring. Most teams have the first and assume it covers the second. It doesn’t.
If you want to know whether your AI is degrading, you have to measure the output itself, on purpose, over time. Closing the gap takes four steps. Start by defining what “good” means for your specific use case, which is the hardest step and the one most teams skip. Score it automatically: test suites against known answers for structured tasks like classification and extraction, and a rubric another model can apply at scale for subjective tasks like chat and content. Run that scoring on a continuous sample instead of an occasional manual review. And set a threshold that alerts you when the score drops, so you hear about it in hours instead of reading about it in a month of support tickets.
None of that is exotic. It’s the same discipline any other production system gets. We just haven’t extended it to the part of the stack that changes without telling us.
Most teams genuinely can’t answer whether their AI is getting better or worse. And the providers aren’t going to tell them. The version you approved in the demo isn’t guaranteed to be the version running in production next quarter. If you can’t measure the gap, you’re trusting that it isn’t there.