Exploiting A/B Testing for Fun and Profit (Part 1)

This post is based on a talk I gave at the Ekoparty security conference in 2016. Given the amount of requests I had on the subject I’m publishing this as a blog post so it can reach a wider audience.

A/B Testing for everyone

From the Obama campaign in 2008 to companies like Facebook or Amazon, A/B testing has been used for over a decade to make decisions based on user generated data yet little research has happened in terms of its security implications. This trend has continued where many products come with A/B testing as a feature, such as advertisings like Facebook Ads or Google Ads, bringing this feature to the masses expanding its ubiquity as a way to make decisions.

A/B Testing can be considered as a way in which companies of all sizes are able to compare how different versions of their products are received by their customers and drive decisions on which ones to use (this is usually called an experiment). They can go from simple algorithms which will display the image which seems to trigger a particular metric the most to complicated multivariate experiments. Some experiments might have small impact on the website (for example experimenting on what image to show or text copy to use) but others can go from deciding if a new product should be launched to how harsh the headline of a newspaper might be¹.

A small explanation of how A/B Testing works would be: When deciding to implement a new feature, business product or text, you would present to your audience two options (and make sure each member of your audience would only see one of them) a control group and a treatment group. A new feature would be placed in the treatment group and then through statistics you would try to determine the effect of that feature on your metrics. If the effect is positive you would make the feature introduced in the experiment permanently, otherwise you would discard the experiment as unsuccessful.

An example would be adding a search box to go through items in a shopping category, if this does impact your metric of amount of items bought by the user by a larger margin that what you would randomly expect, you would start showing the search box to all users thus finishing the experiment and having the variant of the search box “win”.

An image from 2016 on Instagram testing two different captions for their landing page using the Abnormal tool. One could estimate that the metric they are looking for would be amount of registrations for each caption to decide which one to use.

One of the interesting points of A/B testing is that its results might not be obvious, which gives leverage to an attacker to produce any result they want knowing that even if the results seem unintuitive the victim site would still accept them. This could have dangerous implications we will examine further.

Manipulating A/B Testing experiments

The main threat that A/B testing frameworks can introduce is the ability for third parties to influence the experiments maliciously, which introduces the risk that they might want to orchestrate it for their own benefit. Given that experiments will usually work in either an automatic decision-making process based on particular metrics or one that involves a human reviewing the metrics before making a decision, an attacker could exploit those metrics to impact the experiment in the way they might choose.

An example on how this could be done would be by creating multiple sessions which would make it possible to fake the behaviour of multiple users affecting those metrics to change how the website might behave or its background business logic. In this scenario not only external attackers might be interested in subverting these decisions but also internal employees who might want to prove how their decisions or implementations are better.

To achieve this an attacker might follow three steps:

Fake user being generated and retrieving different versions of the same site.

In this step the attacker would create hundreds or thousands of different fake user pretending to have different devices, operating systems, user agents, language locales and ip addresses to get as much diversity as possible according to what the attacker might think the distribution of the victim site is.

For this step the attacker would review all possible versions and start identifying for which fake users they were visible and creating groups which belong to each of them.

Fake users viewing version B are mapped to Experiment B.

As mentioned before experiments have an objective, which is to maximize one or a set of metrics. In some cases attackers will need to rely on different sources (source code, blog posts, talks, publications and business knowledge) to identify potential metrics that might be used by the victim site to decide if an experiment is successful. For example for a news website which does not provide any registration capabilities ‘clicks to a news article’ and ‘amount of time spent on article’ might be metrics that are used to decide if a newspaper headline is better than others.

In this step since we want version B to succeed vs version A we would trigger metrics we think are interesting for the victim site only for members that were shown version B.

While a greater number of fake users would increase the probability of getting a particular version to be accepted by the victim site, it’s important to understand that it might not need a large amount of bots pretending to be users, but it would be enough to give an unfair advantage to a single version an attacker is interested in, tipping the balance in their favor.

In the talk’s first demo the example was using a tool I built called Abnormal to create hundreds of fake users using open proxies. These users would then target certain metrics showing how the A/B testing framework would then allow the manipulated variant of the experiment to become the ‘winner’.

Another alternative an attacker is able to exploit, be it by itself or in addition to focusing on the key metrics of the experiment, would be to focus on possible negative metrics for the members of the version which is not desired, such as stopping all interaction a short time after viewing it. This could be used to fully take advantage of all the fake user created by both using the ones on the version the attacker wants to stay to trigger positive metrics and the ones on the other versions to trigger negative metrics.

How this could be exploited at scale

There is one more perspective of how an attacker might want to manipulate experiments on a much larger scale. Previously I mentioned how an attacker might create fake users to alter an experiment making the victim site believe their users prefer one version more than another. Yet the same kind of manipulation could happen from compromised machines which are part of a botnet or MITM attacks that allow to modify the content of requests even when secured with TLS.

As an example, sites like the New York Times would also perform experiments on their headlines:

Image provided by https://freedom-to-tinker.com/2016/05/26/a-peek-at-ab-testing-in-the-wild/

If news websites use A/B testing experiments to decide their headlines or content², a sufficiently motivated attacker could take advantage of real users interacting with different versions of that website to ensure that the version they prefer would be the one to be chosen by that company. As an example, it is common knowledge that latency can affect the conversion of a website on its main metrics. In the case of Amazon they realised that 100ms extra latency meant 1% lower sales³. An attacker could look at versions that they don’t want to succeed and add enough latency so that users might click or spend less time on the articles. Other possibilities would go as far as to degrade mouse activity increasing the friction the user faced and as a consequence the metrics of the experiment.

On my talk’s second demo I showed how a MITM attack could perform sentiment analysis on the different headlines of a newspaper website and disable clicks on the ones with a negative sentiment of a topic that wanted to be manipulated. This would impact negatively the metrics on that newspaper, which was known to perform A/B testing on its headlines, and by a sufficient advanced adversary it could influence the decisions on which headlines to use influencing the reader’s perception of the topic involved.

So what can we do about it?

Going into what defensive measures should be placed to mitigate the risks of a third party exploiting your A/B testing framework for their own benefit would take an article on its own, but some of the highlights to take into account are:

  • You should remove suspicious bot activity from experiment data when possible.
  • After an experiment ends, there should still be follow ups on the users that had an impact on the experiment to detect possible malicious behaviour such as them being fake users.
  • You could keep a set of users outside of all experiments to follow their behaviour through time and detect if what looks like short term improvements on experiments are actually decreasing your main metrics on the long run.
  • There should still be analysis on results when they seem counter intuitive, and not just taking them blindly because a metric might have improved.
  • Selection of the variant of a user in an experiment needs to be random enough that it cannot be predicted or manipulated by a malicious third party.

Slides of the talk
Video of the talk
The code used in the talk

For details on how A/B testing could be used by penetration testers you can check the second part of the post here.

All about security and scalability. Views expressed are my own.