Avoiding the A/B Testing Trap

August 5, 2021

Other

The Importance of Structuring the Correct Control Group Split

Marketers have known for years that testing is key to developing impactful ad messages and product offers before scaling their campaign investments. However, not every marketer structures and conducts experiments correctly — including basic A/B tests. One of the most important decisions in structuring A/B tests is the user split between the control group and the variation group. Risk aversion can lead to incorrectly splitting these groups and can result in misleading results and lost profits. This is especially true in ecommerce retail and the impact on revenue is something that is often overlooked by retail marketers.

Why Control Group Splits are More Important Than You Think

A/B tests are a powerful way to reliably increase the profits of an ecommerce business as they can be used to test a variety of drivers of revenue: different recommendation AI models, product offers, CTAs or a new shopping cart funnel. Thanks to the tools such as Google Optimize and Adobe Target they are also pretty common and fairly easy to orchestrate these tests. Unfortunately, it’s also very easy to make a simple oversight when setting up A/B tests and it can actually cost brands money.

One of the A/B test design decisions is the user split: the proportion of users who will be attributed to the control group vs. to the treatment group. There are some who are considering splits other than 50/50 (an equal split) a viable alternative and they assert that a 80/20 split is a safer alternative for risk averse marketers. Their argument is that in the event the tested alternative performs worse than the control groups, the test will lose less money. That explanation sounds intuitive and therefore is rarely questioned. However, this argument is hiding the ugly truth – instead of saving money, the uneven split (80/20) is bound to miss more revenue than the equal split (50/50).

How is that so?

Beyond Cost: The Impact of Test Duration on Revenue

When running the A/B test we’re hoping to observe a measurable difference between the control group and variant group. However, both groups must accumulate a minimum audience exposure to become statistically significant — and that takes time. The further we deviate from an equal (50/50) exposure split, the more time is needed for the test to be run for the performance difference to become statistically significant. What is often overlooked is that this additional testing time can negatively impact revenue.

Consider an example A/B test where the goal is to design the test to be able to identify at least a 5% improvement conversion rate:

Conversion Rate = 2.6% (testing KPI)
Avg. daily sessions = 100,000
Average Order Value = $100
Tolerable error = 5%
New minimum target for variant group: 2.73% CVR (5% improvement of CVR). This is calculated as follows: 2.6% x (1 + 5%) = 2.73%. Note that the smaller the performance improvement, the longer the test duration.

Using a frequentist approach, both of our groups will need a minimum sample size of 399,000 sessions to draw statistically relevant conclusions.

Where:

Z – inverse cumulative density function
λ – treatment/control ratio
pc – control CVR
pt – treatment CVR

The test duration needed to produce a sufficient sample size would be:

8 days with a 50/50 (equal) split
20 days with a 80/20 split

Now let’s assume that the worst thing happened: our proposed test variant actually lowered conversion rate by 5%.

Let’s take a look at the impact to revenue:

50/50 split: we lose $6,500 daily for 8 days ($51,800* in total)
80/20 split: we lose $2,600 daily for 20 days ($51,800* in total)

* Rounded value

The 80/20 split is not the only unequal testing option, but the phenomenon is the same when compared to a 50/50 equal split.

If our proposed variant group delivers a 5% worse conversion rate, we would need 12 more days to validate our finding. Sure, we lose less money per day, but our revenue declines for a longer timeframe. No matter the split, in case of a failed variant, revenue lost amounts to the same value.

Do we have a tie then? Not exactly: by choosing an 80/20 split, we are unnecessarily prolonging the test duration, thus delaying both full implementation of our variant and the start of the next test in the pipeline. By doing so, we’re delaying harvesting additional revenue into the future — and this is a financial loss that is often overlooked by marketers.

The Bottom Line

By choosing an uneven split, marketers are not necessarily reducing risk, but rather guaranteeing lost revenue. What’s worse, they are putting off the start of the next test, thus delaying another chance to secure another source of revenue.

For a detailed analysis, review the spreadsheet.

Want to know more? Talk to an expert.