When there is no there there: going large with A/B testing and MVT

Gertrude Stein was right.   There is no there there.   There is no Facebook.  There is no Google.  There is no Amazon.  There is no such thing as a Zynga game.   There isn’t even a  Bing.

I’m not talking about what it’s like living off-grid, by choice or necessity.

I’m talking about the fact that when we interact online with any of these major services, we interact in one of the local reality zones of their multiverses.   The dominant large-scale consumer internet apps and platforms do not exist in a single version.  They all simultaneously deploy multiply variant versions of themselves at once, to different people, and pit them against each other to see which ones work better.   The results of these tests are used to improve the design.  Or so goes the theory.  (It should be borne in mind that such a process was probably responsible for the “design” of the human backbone.)

This test-driven approach to design development and refinement has been promoted and democratised as a “must-have” for all software-based startups.   Eric Reis, of Lean Startup, is probably the most famous.  (Andrew Chen is worth checking out, too, for a pragmatic view from the trenches.)

How do the big platform providers do it? Lashings of secret sauce are probably involved.    But there is a lot of published commentary lying around from which the main ingredients of the sauce can be discerned –  even if the exact formulation isn’t printed on the label.   Here are some resources I’ve found useful:

  • Greg Linden, the inventor of Amazon’s first recommendation engine, has a nice collection of fireside tales about his early work at Amazon in his blog, including how he got shopping cart recommendations deployed (spoiler:  by disobeying an order – and testing it in the wild)
  • Josh Wills, ex-Google,  now Director of Data Science at Cloudera,  talks about  Experimenting at Scale at the 2012 Workshop on Algorithms for Modern Massive Data Sets, and provides some analytical and experimental techniques for meeting the challenges involved
  • Ron Kohavi, ex-Google, now Microsoft, has a recent talk and a recent paper about puzzling results from experimentation and how his team has solved them, in his 2012 ACM Recsys Keynote speech, and his 2012 KDD paper.

There are some commonalities of approach from the ex-Googlers.   Assigment of people to experiments, and experimental treatments, is done via a system of independent layers, so that an individual user can be in multiple experimental treatments at once.   Kohavi talks about how this can go wrong,  and some ways of designing around it using a modified localised layer structure.

Another efficiency-boosting practice is the use of Bayesian Bandit algorithms to decide on the size of experimental groups, and length of experiment.   This practice is most familiar in clinical trials, where adaptive experimentation is used to ensure that as soon as a robust effect has been found, the trial is halted, enabling the ethically desirable outcome that beneficial treatments are not withheld from those who would benefit, and injurious treatments are stopped as soon as they are identified as such.   It’s so much flavour of the month that there is now a SaaS provider, Conductrics,  which will enable you to use it as a plugin.  They also have a great blog so check it out if you’re interested in this topic.   Google Analytics Content Experiments also provide support for this, in a more constrained way.

So there are lots of hints and tips about optimising the mechanics of running a test.   But there isn’t as much talked about what to test, and how to organise a series of tests.  Which is, for most people, the $64 million question.   This is something I’ve done some thinking on and talking about and advising on.   I’m still working it through, though  – and if you are too, and you know of any interesting resources I’ve missed -do share them with us.

Facebook says it isn’t an echo chamber. But is it a hall of mirrors?

Putting aside for a moment from the actual questions the Facebook Data Team asked and the answers they got, both of which are interesting (so I will try to get to grips with them – in a later post),  I think the most surprising thing about this study is the fact that Facebook publicly used itself as an experiment,  and nobody blinked.    What they did in the study was only a tiny, weensy bypass operation.  (The experiment involved  withholding some newsfeed items from its users that they would normally have seen as result of the operation of the service.)    But it was surgery none the less.

My own private speculation is that Facebook experiments on itself all the time, and then quietly gets on with applying the lessons it learns by doing so.   But the Data Team’s world-facing work is usually correlational and observational rather than directly experimental.  So this work is different: here, they did tweak around with Facebook, and they did publish their results.  And I am surprised that nobody seems to be interested in that fact.  I don’t have an issue with the fact they did it.  In fact,  I’d be a bit disappointed if they didn’t.  But then I’m an experimentalist by background.

Don’t get the wrong idea: I do care whether I see stuff that’s sent to me.   When I get the post delivered from the postman each morning I don’t expect him to randomly hide some of it from me just to see what would happen.    Interestingly I have heard more than one story in which it turns out that is just what (a few lone and deranged) postal workers sometimes actually do.   But although this isn’t unheard of, at least as an urban myth,  it’s not what I expect, and were the behaviour to be discovered, I would expect it to be stopped.   Ignoring my post is my job, not my postman’s.

But my view of my Facebook Feed is different.  My understanding of my Feed is that it is cooked up according to a secret sauce recipe which, although it isn’t exactly to my personal taste, represents Facebooks’ best efforts at optimising something of interest to it.   And I believe the recipe for this sauce is constantly evolving, although the brand remains the same.   So to find there has been a tiny systematic tweak made to it, whereby some information was hidden for some people when it would normally have been displayed,  is neither a big shock, nor a bad one.

What surprises me about it is that it seems to have been so generally unsurprising.  I have a few different theories about this:

1.  the Eric Reis theory

The Lean Startup ideas of Eric Reis have become so pervasive and “baked into the DNA” of our culture that everyone who thinks about the matter expects to become part of some massive multivariate test whenever they encounter any application or platform.

2.  the filter bubble theory

Nobody who would potentially have been offended or puzzled by having their NewsFeed tweaked around with actually understood what was happening.

Here are some screenshots I took this morning about the relative numbers of people who publicly lauded the research summary, versus the full research article.

The score is as follows.   There were over 5,000 social actions performed on the summary.   And 165 on the article itself.

3.  the common sense theory

The change made was so non-material in its potential and actual impact that nobody in their right minds could possibly make a big deal of it.

So, no shortage of theories.  But I’ve no idea which one is right.   Do you?    My guess about the recipe is:  20% Theory 1, 60% Theory 2, and 20% Theory 3.

The highly munchable and crunchable soundbite about the study, distributed with the summary ,  was that the research demonstrates that Facebook is not an echo chamber.   This meme has bounced around languidly, albeit dominantly,  following the release of the research.   I believe that the research, while interesting, does not actually warrant this conclusion directly.   But what the research does demonstrate, by its very existence, is that whether or not Facebook is or is not an echo chamber, for sure it sometimes acts like a hall of mirrors.

Source: ItDan - Flickr