A/B Testing with a Data Science Approach
source link: https://www.tuicool.com/articles/hit/ZVVRfaQ
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
If your A/B testing doesn’t seem to work, you might be making one of the common mistakes, such as the peeking problem, wrong split, or wrong interpretation . This can completely destroy the profit from the experiment and can even damage the business.
As a data scientist, I want to describe the design principles of A/B tests based on data science techniques. They will help you ensure that your A/B tests show you statistically significant results and move your business in the right direction.
Define a quantifiable success metric
As you may already know, A/B testing, or split testing, is a randomized experiment in which you want to choose the best variant of two hypotheses. The use cases of starting such testing can be a landing page redesign, headline testing, banner testing, and so on.
Before starting a range of experiments, the main goal is to define a quantifiable success metric for your experiment. It should reflect the changes and play a fundamental role in making the right decision.
The metrics usually reflect the business’s goals. The most popular metrics to measure a hypothesis are:
- conversion rate (page view to signup, pave view to button click, etc.)
- economic metrics ( revenue per shift, average check, etc.)
- behavioral metrics (depth of pageview, average session duration, user retention rate, feature usage)
For example, we want to add a product video on the Statsbot main landing page and test how it performs compared to the page with a product image. A success metric for us would be the conversion rate from a page view into a signup.
Randomize your traffic
Let’s split our clients into two groups: A and B. The first group will continue to see the old version of a landing page and the second will start interaction with the new page with a video.
The success of a whole A/B testing depends on the right split into groups. It actually can vary in specific cases, but the main requirement is that the two samples should be homogeneous .
It’s an extremely sensitive issue, which influences all your further actions. People often come up with pseudo-random splits, which actually correlate with age, gender, nationality, geo, etc. Data science approaches help prevent the cases when split groups are dependent on some factors.
One of the famous ways to validate splitting is the Intraclass correlation coefficient (ICC) , which can demonstrate the difference of feature distribution for each splitting group:
- The ICC value is close to 0 or even less than 0, because Fisher’s formula for producing ICC is unbiased. This means that the splitting strategy is a good one.
- The ICC value is close to -1 or 1. This means that you’d better try another splitting strategy.
Besides, you need to take into account the split ratio. A 50:50 split is the most popular choice for simple A/B testing and leads to the quickest results . Nevertheless, many companies make such things very carefully and split leads 20:80 or even 10:90, with a small fraction given to the experiment group. An unsuccessful experiment can lead to big loss of conversions or income.
Achieve statistical significance
Even if you see the breathtaking success in increasing your quantifiable metric, you’d better wait for achieving statistical significance with the experiment.
The main metrics that affect the statistical significance of any experiment are the effect size, the sample size, and the alpha significance level.
In the A/B experiment we have two hypotheses (H0 against the alternative H1) and calculate the appropriate statistic depending on the selected statistical criterion. For our example, we can test whether the means of two samples are equal (H0), or, alternatively, the means differ strongly (H1).
The P-value (our statistical significance) is the probability of observing a statistic at least as extreme as those measured when the null hypothesis is true. If the p-value is less than a certain threshold (typically 0.05), then we don’t reject hypothesis H0.
One of the most popular statistical tests for an A/B experiment is a Student’s t-test for 2 samples (checks equality of means). It performed well for small amounts of data, since it takes into account the size of the sample when assessing the significance. You can choose a suitable statistical test for your experiment here .
There is also a large number of services that can help you calculate the appropriate number of visitors to achieve your goal:
Always rely on statistical tests, don’t try to compare just means or medians of the main and experimental group of users. It can make you go the wrong way. Here we can see, that despite absolutely equal means, the effects are totally different.
The smaller the intersection, the more confident we can say that the effect is really significant.
Interpretation problems
Finally, we have finished the A/B test. If the experiment group has showed the statistically significant improvement of our success metric, we can add a product video on the homepage of our website with confidence.
If not, it’s important to analyze the obtained result and revise the whole cycle of the experiment . Probably, the problem is in a wrong split or the wrong time period of testing that affects specific subgroups of people, or some other influence that wasn’t taken into account.
Such external and internal factors can be:
- advertising campaigns
- day of the week
- weather or seasonality
- spike of market activity
- call-center operations
- employee actions
The most common and tricky mistake that can occur at this stage of A/B testing is the so-called peeking problem .
To avoid it, you have to define the sample size before the experiment and calculate the result only on this sample. Not too early, nor too late.
Key advice
A/B testing is nonetheless a very thin and insidious thing if you make it badly. The following advice based on a data science approach can help you benefit from your experiment.
- Define a success metric and effect size before starting the test.
- Never rely on your intuition, and don’t stop the experiment until achieving statistical significance.
- Test the statistical significance only after you finish, take into account a peeking problem.
- Think about the period of testing. It’s a bad idea to run an A/B test during the weekend or holidays. The right experiment should cover all weekdays, all traffic sources, and so on.
- Try to concretize the big novelty into smaller “subnovelties,” because the cross effect can show no improvements, while smaller things can separately increase the metric.
- Don’t expect a significant improvement of the metric. The majority of successful A/B tests give a 1-2% increase of the metric.
- Be careful of the data noise in your test. Statistical criteria can’t catch it. For example, many people can be interested in a new feature or new design when it is initially released. This leads to abnormal behavior and should be cleaned up in a final comparison.
- Don’t start A/B testing if you have only a few clients. The process can take months or even years for achieving a statistical significance and will be wrong in most cases.
Don’t be afraid of A/B testing your hypothesis, just do it right using the data science approaches above.
What you need to do is form a hypothesis with a success metric, then randomize your traffic correctly, achieve a statistical significance on a whole sample size, and interpret the results taking into account as many factors as you can. Otherwise, you risk wasting a lot of resources to get insights that mess up your business.
Recommend
-
41
Hello Windows Insiders, Today, we are beginning to test the throttled approach we intend to use to rollout 19H2 to customers with Windows Insiders using the Release Preview ring. This is not the final 19H2 build th...
-
32
1. Introduction I have personally always found interesting the differences in how people are or act, so this study will allow us to know a little more about it. We will have insights on how the characteristic...
-
20
The No-Code Approach To Data Science And AI No matter your level of expertise and experience in this field, it is essential to have a good knowledge of using a no-code approach to working with data if...
-
9
A CLR Interception Approach to Unit Testing ~ Coding AdventuresRecently I have contacted Typemock and I was offered a license of Typemock Isolator, their flagship tool, in support of my OSS project
-
3
Game Introduction Pokémon GO may be the most popular mobile game in the world in 2016. It uses the technology of augmented reali...
-
5
The problem In the past few months, my team has been working on a big project written with Ruby on Rails. The key component of our application is integration to many external APIs like d...
-
6
How to Automate PDF Testing? A Truly Straightforward Approach
-
3
Introduction In one of my previous articles I was talking about the option of using Postman to support testing SAP Cloud Integration scenarios by...
-
3
This article was published as a part of the Data Science Blogathon. Introduction to Hypothesis Testing Every day we fin...
-
0
My approach to learning new data science conceptsA framework for picking things up quickly and efficiently
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK