When there is no there there: going large with A/B testing and MVT

Gertrude Stein was right.   There is no there there.   There is no Facebook.  There is no Google.  There is no Amazon.  There is no such thing as a Zynga game.   There isn’t even a  Bing.

I’m not talking about what it’s like living off-grid, by choice or necessity.

I’m talking about the fact that when we interact online with any of these major services, we interact in one of the local reality zones of their multiverses.   The dominant large-scale consumer internet apps and platforms do not exist in a single version.  They all simultaneously deploy multiply variant versions of themselves at once, to different people, and pit them against each other to see which ones work better.   The results of these tests are used to improve the design.  Or so goes the theory.  (It should be borne in mind that such a process was probably responsible for the “design” of the human backbone.)

This test-driven approach to design development and refinement has been promoted and democratised as a “must-have” for all software-based startups.   Eric Reis, of Lean Startup, is probably the most famous.  (Andrew Chen is worth checking out, too, for a pragmatic view from the trenches.)

How do the big platform providers do it? Lashings of secret sauce are probably involved.    But there is a lot of published commentary lying around from which the main ingredients of the sauce can be discerned –  even if the exact formulation isn’t printed on the label.   Here are some resources I’ve found useful:

  • Greg Linden, the inventor of Amazon’s first recommendation engine, has a nice collection of fireside tales about his early work at Amazon in his blog, including how he got shopping cart recommendations deployed (spoiler:  by disobeying an order – and testing it in the wild)
  • Josh Wills, ex-Google,  now Director of Data Science at Cloudera,  talks about  Experimenting at Scale at the 2012 Workshop on Algorithms for Modern Massive Data Sets, and provides some analytical and experimental techniques for meeting the challenges involved
  • Ron Kohavi, ex-Google, now Microsoft, has a recent talk and a recent paper about puzzling results from experimentation and how his team has solved them, in his 2012 ACM Recsys Keynote speech, and his 2012 KDD paper.

There are some commonalities of approach from the ex-Googlers.   Assigment of people to experiments, and experimental treatments, is done via a system of independent layers, so that an individual user can be in multiple experimental treatments at once.   Kohavi talks about how this can go wrong,  and some ways of designing around it using a modified localised layer structure.

Another efficiency-boosting practice is the use of Bayesian Bandit algorithms to decide on the size of experimental groups, and length of experiment.   This practice is most familiar in clinical trials, where adaptive experimentation is used to ensure that as soon as a robust effect has been found, the trial is halted, enabling the ethically desirable outcome that beneficial treatments are not withheld from those who would benefit, and injurious treatments are stopped as soon as they are identified as such.   It’s so much flavour of the month that there is now a SaaS provider, Conductrics,  which will enable you to use it as a plugin.  They also have a great blog so check it out if you’re interested in this topic.   Google Analytics Content Experiments also provide support for this, in a more constrained way.

So there are lots of hints and tips about optimising the mechanics of running a test.   But there isn’t as much talked about what to test, and how to organise a series of tests.  Which is, for most people, the $64 million question.   This is something I’ve done some thinking on and talking about and advising on.   I’m still working it through, though  – and if you are too, and you know of any interesting resources I’ve missed -do share them with us.

7 thoughts on “When there is no there there: going large with A/B testing and MVT

  1. Hi, thanks for the kind words about Conductrics and our blog. We try to make it useful for the general analytics community. I also read your post on recommendation systems, and even though adaptive tests are now flavor of the month, we have been doing it since 2010 ;-), you might want to think about the coupling of Recsys with Bandits. I know Yahoo did some work on this – take a look at http://pages.cs.wisc.edu/~beechung/icml11-tutorial/ – I think this gets at what you are looking for. This way you can seed your system with historical data (use your favorite matrix factorization), but then efficiently sample variations online using some sort of bandit algorithm in order to attempt to improve results, and I guess perhaps address any drift in the environment. There is also a video of their KDD talk on videolectures.net.
    Thanks again!
    Matt

    Like

    • Hey, Matt – Thanks! Your blog really helped me fill in some missing vocab – I think at least some of what I’m looking for will be filed under reinforcement learning – and it’s got loads of great refs in it – already spent some time looking at Ng’s stuff this afternoon….;-) There’s a huge amount of stuff going on in the dynamic ad/content optimisation space, for sure – out of London alone there’s Cognitive Match, Rummble Labs, Peerius, Visual DNA, Ad Totum and those are just the ones I’ve run into by accident – but it’s hard to tell what each one really does, standing outside the black box. So I’ll chase the Yahoo reference up pronto.

      Like

  2. “…adaptive experimentation is used to ensure that as soon as a robust effect has been found, the trial is halted, enabling the ethically desirable outcome that beneficial treatments are not withheld from those who would benefit, and injurious treatments are stopped as soon as they are identified as such.”

    The problem with adaptive experimentation is that one cannot determine the *long term* effect of a test, and whether a change is lasting or just a blip. A feature or tweak can reach significance within a week, but unless significance stays constant for X days (how many is up for debate), oftentimes one might see an almost inconceivable spike in the other direction, a spike from significance to insignificance–and by a month one sees that the test results in noise or with the opposing shard victorious.

    Although one can see the long-term effects after a test is concluded, in some cases one might expect the variants to regress towards the mean, and other times one might expect the lift to remain constant.

    In a fast-paced business environment, adaptive experimentation may be a must or perhaps the most optimal form of testing and re-iterating, but beware of using such a method as gospel: collapsing to those seemingly significant variants immediately will add up as misfires and may bite a company in the ass in the long run.

    Like

    • One of the “interesting” complexities that you touch on is whether what you really really really care about is something that can only be measured a long time out. In which case all the measures you take in the interim need to be treated with a healthy pinch of salt (er, or whatever the healthy equivalent of that is….;).

      Like

    • Actually, you have this problem regardless if you are using Hypothesis testing or one of the more explicit solutions to a bandit problem. If you think about it, using hypothesis testing for policy selection, is like a step function. You play random, then if after a PREDETERMINED amount of time/samples, you either reject or fail to reject – and pick a ‘winner’. The adaptive methods are more like an S-curve – playing at random at first as well, but then continuously adjusting the the probability of selection – kinda like creating a biased roulette wheel, where higher performing actions, get larger slices of the wheel.
      Of course, if there is drift (non stationary environment) both approaches can fail – because the sample will be structurally different from your production environment. However, the adaptive methods sample through out the optimization, whereas Hypothesis testing samples first then apply the learning, so adaptive methods can be structure to continuously sample a small percent of the time, to pick up any changes in the environment.
      That said, adaptive methods are not necessarily best for all problems. The adaptive nature is really of value in problems that are either 1) perishable, have a short shelf life ; 2) complex environment- so lets say you are also using targeting and so hypothesis testing is made complicated (I guess one would use ANOVA with blocking variables?). You might want to use an adaptive approach, since you don’t know the frequency of seeing the various user features – so you might have good sample size for frequent features, but not for rare features at any given point of time which leads to ; 3) Automation. Lets say you are spawning many optimization projects – using adaptive methods gives you a framework for automating the optimization so that you don’t need a human to evaluate the hypothesis test (which has its own issues – misinterpretation of p-values, using p-values as a stopping rule, etc.)
      So for low risk, high frequency decisions, then adaptive is prob a good idea. If you are making a major, discrete change (say the branding of the site) then by all means make it a more formal test. We let users choice between adaptive or non-adaptive tests, since what makes the most sense really depends on the nature of the problem. If I may, here is a link to a post we have that gets at the opportunity costs of experimentation and why you might want to use an adaptive test. http://conductrics.com/balancing-earning-with-learning-bandits-and-adaptive-optimization/ (please feel free to delete this if you would rather that I don’t link off the blog 🙂 ).
      Thanks, and feel free to reach out directly with any questions.

      Like

Leave a comment