The history effect induces a bias in an AB Framework
Unlike lab tests, AB tests are not run in isolation and hence are prone to be affected by external events. With the orthogonal splits in our new AB framework we have largely ensured that any external event has equal effect on both control (A) and variations (B/C etc), thereby cancelling out each other.
Yet, with our high-velocity experimentation approach and limited set availability, we are not completely immune to the history effect. Here the effect is not from any external events but from the preceding experiments themselves. Literally, the ghost of preceding experiments…^_^
In certain scenarios, the effect of the experiment can outrun the experiment itself by a few days, or at times by weeks. If the set on which the experiment was running is immediately assigned to a new experiment (which happens quite often), we haven’t really started the experiment on a level playing field. The bias from the previous experiment is bound to manifest in the new experiment.
A bias through the history effect in our experiments could be broadly categorized in two types
Example of bias through Pull Forward Effect:
Let’s say we ran a telesales experiment on high propensity leads while leads in control (A) were not called and we saw a x% lift in orders. After confirming the growth being statistically significant we went ahead and made it live for all users.
Having the sets released from the telesales experiment, we started a new price hike experiment and we immediately notice a dip in orders in the Variant (B) of the price hike experiment. The dip could be more likely resistance to the price hike, but are we confident? Since we are reusing the sets from the previous experiment, there is a high probability for fewer sales, as we have already sold through telesales – thus shrinking the pool (a minor pull forward effect).
- Deep- discounting experiments: One of the variant of the experiment offers deep discounting.
- Bulk Interest experiments: One of the variant doesn’t offer to send bulk interest in the registration flow.
Example of bias through delayed response:
- Mailer/Notification experiment (Variation on content, not on landing page): CTR on mailers/notifications has an exponentially decaying curve. Effects, although decaying, would exist for a few days beyond the event and would impact the next experiment if the same sets are used.
- Market Dev experiments: Leads stay with the advisors for certain days and conversion can happen throughout this period.
Although this issue is unique to us, as ‘pre-defined’ sets are hardly used by other companies for AB testing. Most of the large companies like Facebook, Google and Microsoft, who run experiments on sessions – use a buffer period, both before and after the experiment.
During the buffer period A/A is run to check for biases. Also this effect is only observed in existing users, new users would be unaffected by this.
What’s the Solution?
- The best solution is to quarantine all sets for a specific period of time after concluding the experiments. This would result in lowering the number of experiments we concurrently conduct.
- Alternatively, we can also look at choosing at the beginning of an experiment whether it would have trailing effects and how long it would last and pass it on parameters to the experiments. This way, only experiments with trailing effects could be allowed to cool-off post experiment.
Also now that we are aware of this issue, we should go back and validate, if any of the critical experiments in the past were affected by this.