Proving the Fix Worked: Hypothesis Testing for People Who Skipped Statistics
An intake team at a regional distributor was convinced they had fixed a chronic delay. Through 2021, pandemic-era supply disruptions had stretched their order-confirmation cycle, and after a workflow change the numbers looked better. The improvement lead wanted to declare victory. Her Black Belt asked one quiet question: how do we know this isn't just a good month? That question is what hypothesis testing exists to answer.
You do not need a statistics degree to use it. In the Improve and Control phases of DMAIC, hypothesis testing simply gives you a disciplined way to decide whether a difference you observe is real or could plausibly be noise. The whole idea rests on one habit of mind that is uncomfortable at first but powerful: you start by assuming your change did nothing.
The two competing claims
Every test sets up two statements. The null hypothesis says there is no real difference — the before and after are effectively the same, and any gap is chance. The alternative hypothesis says there is a genuine difference. You then ask: if the change truly did nothing, how surprising is the result we actually saw?
State the question in plain language first. Ours was: did the new workflow reduce average confirmation time? Average is the key word — it points to comparing two means.
Pick the matching test. Comparing the average of a before group and an after group on a continuous measure like hours is a two-sample t-test. Comparing pass/fail rates would instead call for a proportions test.
Set the threshold before you look. The team chose the conventional significance level of 0.05. Deciding this in advance stops you from rationalizing whatever result you get.
Read the p-value as a probability of luck. It estimates the chance of seeing a difference this large if the change actually did nothing. Below your threshold, you reject the null and treat the improvement as real.
The team pulled 30 confirmations from before the change and 30 after — enough to be meaningful, and easy to gather. The before-average was about 19 hours; the after-average about 14. The t-test returned a p-value of 0.01. Because that is below 0.05, the conclusion was clear: a drop this size was unlikely to be luck. The fix was real.
The traps that fool honest people
Confusing 'not significant' with 'no effect.' A high p-value often means your sample was too small to tell, not that nothing changed.
Treating the 0.05 threshold as sacred. A p-value of 0.06 is not failure; it is a signal to gather more data, not to give up.
Ignoring practical significance. A change can be statistically real yet too small to matter operationally — always ask whether the size of the effect justifies the effort.
Cherry-picking the time window until the numbers cooperate. Set your sample plan before you peek at results.
What made the difference for this team was not the math — software did the calculation in seconds. It was the discipline of framing the question, choosing the test, and committing to a threshold before looking. That sequence turned a hopeful anecdote into evidence they could defend to a skeptical operations director. The change stayed in place, and when a later tweak failed to clear the same bar, they had the confidence to roll it back instead of shipping a non-improvement.
Hypothesis testing is not about being clever with numbers. It is about being honest about uncertainty — refusing to confuse a lucky stretch with a durable gain. For any team trying to prove an improvement is worth keeping, that honesty is the whole point.
If your team is making changes but cannot prove which ones actually work, XNM's strategic advisory can help you build the measurement discipline to separate real gains from noise.