Essay

5 ways your A/B test is lying to you (and the engineering fixes)

A/B test results mislead more often than teams admit, and almost always because of how the experiment was instrumented, not the statistics. Here are the five failures I see most, and how to build so they don't happen.

Published
Read
7 min

Hey, I’m Andrew. I’ve spent six years building growth features, and most of that time the difference between a real win and a fake one came down to instrumentation. This is the first in a series on the engineering side of growth.

A few years ago I watched a team ship a feature, run an A/B test, see a 12% lift, and roll it out to everyone. Three weeks later the metric it was supposed to move hadn’t moved at all. The test said win. Reality said nothing happened.

The feature was fine. The instrumentation was not.

Here’s the uncomfortable part: an A/B test result is only as trustworthy as the measurement underneath it, and measurement is an engineering problem long before it’s a statistics problem. You can run a textbook-clean experiment (proper randomization, a real significance threshold, a pre-registered hypothesis) and still get a confidently wrong answer, because the events feeding that calculation were defined badly, fired in the wrong place, or bolted on after the feature was already built.

Most A/B testing guides talk about statistics. This one is about the engineering: the part that actually decides whether your result means anything. Here are the five failures I see most, and how to build so they don’t happen to you.

1. You instrumented after you built, not before

The most common mistake is treating instrumentation as a final step. The feature gets built, it works, and then someone says “we should track this” and a few analytics calls get sprinkled in near the end.

The problem: by then you’ve already made a dozen decisions that shape what’s measurable, and you made all of them without asking what the experiment needs. Where does the user “enter” the feature? What counts as having used it? When are they far enough along to call it a success? Answer those after the code exists and you answer them based on where it’s convenient to drop a tracking call, not on what the experiment needs to know.

The fix is a sequence change, not a tooling change. Before you write the feature code, write down the experiment’s questions: what’s the exposure event, what’s the success event, what are the guardrail metrics that tell you the feature didn’t quietly break something else. Then build so those events are first-class: emitted from the right place, with the right data, every time. Instrumentation designed up front is precise. Instrumentation added at the end is wherever the cursor happened to be.

2. Your event definitions drift between flows

“Activated.” “Signed up.” “Completed onboarding.” These sound precise. In most codebases they aren’t. The same conceptual event gets emitted from three places by three people, and each version means something slightly different.

One path fires the event when the user lands on the success screen. Another fires when the API call returns. A third fires on a click, before the action has even succeeded. Now your experiment is comparing a treatment group measured one way against a control group measured another, and the difference you’re reading might be entirely an artifact of which code path each group traveled.

The fix is to treat event definitions as shared contracts, not strings. Define each meaningful event once (its name, exactly when it fires, the properties it carries) in a single place the whole codebase imports from. A typed event schema is ideal: it makes the definition unambiguous, makes drift a compile error instead of a silent data bug, and gives anyone adding a new flow one correct source to copy. “Activated” should mean precisely one thing, no matter who emits it or where.

3. You peeked, and then you stopped early

This one is half psychology, half engineering. Experimentation dashboards update continuously, so it’s tempting to watch. You check on day two, see significance, ship.

The problem: a result wandering across the significance line on its way to a real sample size is not the same as a result that is actually significant. Check repeatedly and stop the moment you like what you see, and you’re not running an experiment. You’re running a slot machine until it pays out. The false-positive rate of “check every day, stop when green” is far higher than the threshold you think you set.

The fix is partly discipline, partly defaults. Decide the sample size or run duration before the experiment starts, and treat it as a commitment. Better, build the guardrail into the tooling: surface a clear “not ready, don’t call this yet” state until the predetermined sample is reached, so reading the result early takes deliberate effort instead of being the default view. If your platform supports always-valid statistics or sequential testing, that’s a genuine fix to the peeking problem rather than a workaround. Use it. Either way, the decision to stop should be made by a rule set in advance, not by your mood on a Tuesday.

4. You logged exposure in the wrong place

Exposure (the event that says “this user was actually in the experiment”) is the most important event in the whole setup, and the easiest to get subtly wrong.

Log exposure too early and you count users who were assigned to the experiment but never reached the thing being tested. They dilute both groups with people the change couldn’t possibly have affected, and they push your measured effect toward zero. Log it too late (after some piece of the new experience has already had its chance to act) and you’ve thrown away part of the effect you’re trying to measure.

The fix is to define exposure as the precise moment the user could first be affected by the treatment, and to log it exactly there. Not at assignment, not at page load, but at genuine first contact with the change. This is also why failure #1 matters so much: getting exposure right is nearly impossible to retrofit. It has to be a deliberate decision made while the feature is being built.

5. You ignored the guardrails and celebrated the win

The fifth failure is the quietest. The experiment moved its target metric, the team celebrated, and nobody checked what else moved.

A change that lifts your conversion number can simultaneously raise refund rates, drop a quality metric, or shift load onto a downstream system. And if you only instrumented the win condition, you’ll never see it. You’ll ship a “win” that’s actually a trade, and you won’t find out until the cost shows up somewhere nobody connected back to the experiment.

This is a documented pattern, not a hypothetical. Microsoft’s experimentation team has written publicly about cases where a feature improved its target metric while degrading a guardrail, and about treating guardrail metrics as first-class, not optional. The fix is cheap: every experiment defines at least one or two guardrail metrics up front (the things that must not get worse), and reading the result means reading those too. A win that you can’t confirm left the guardrails intact isn’t a win yet.

A checklist you can actually use

Before an experiment starts, confirm all of this:

  1. The questions are written down first. Exposure event, success event, and guardrail metrics defined before any feature code is written.
  2. Every event has a single definition. Each event exists once, as a typed contract, with a documented firing condition and property set.
  3. Exposure fires at first genuine contact. Not at assignment, not at page load: the precise moment the user could first be affected.
  4. The stopping rule is set in advance. Sample size or duration decided before launch, with the tooling making “not ready yet” the visible default.
  5. Guardrails are read alongside the win condition. At least one metric confirms the feature didn’t quietly degrade something else.

None of this is statistics. It’s all engineering: decisions about where code lives, how events are defined, what the tooling makes easy. That’s the real lesson: by the time a result reaches a statistician, its trustworthiness has already been decided by the engineers who built the instrumentation. Instrumentation is a design decision you make before the build, not a step you tack on at the end.

Get that right and the statistics can do their job. Get it wrong and no amount of statistical rigor will save you. You’ll just be very precise about a number that was never real.


I write a weekly piece on the engineering side of growth: experimentation, onboarding, and conversion, from someone who builds it. If that’s useful, you can subscribe here. And if you’ve got a war story about an experiment that lied to you, reply and tell me. I collect them, and the best ones end up in a future issue.