When there is no there there: going large with A/B testing and MVT

Gertrude Stein was right.   There is no there there.   There is no Facebook.  There is no Google.  There is no Amazon.  There is no such thing as a Zynga game.   There isn’t even a  Bing.

I’m not talking about what it’s like living off-grid, by choice or necessity.

I’m talking about the fact that when we interact online with any of these major services, we interact in one of the local reality zones of their multiverses.   The dominant large-scale consumer internet apps and platforms do not exist in a single version.  They all simultaneously deploy multiply variant versions of themselves at once, to different people, and pit them against each other to see which ones work better.   The results of these tests are used to improve the design.  Or so goes the theory.  (It should be borne in mind that such a process was probably responsible for the “design” of the human backbone.)

This test-driven approach to design development and refinement has been promoted and democratised as a “must-have” for all software-based startups.   Eric Reis, of Lean Startup, is probably the most famous.  (Andrew Chen is worth checking out, too, for a pragmatic view from the trenches.)

How do the big platform providers do it? Lashings of secret sauce are probably involved.    But there is a lot of published commentary lying around from which the main ingredients of the sauce can be discerned –  even if the exact formulation isn’t printed on the label.   Here are some resources I’ve found useful:

  • Greg Linden, the inventor of Amazon’s first recommendation engine, has a nice collection of fireside tales about his early work at Amazon in his blog, including how he got shopping cart recommendations deployed (spoiler:  by disobeying an order – and testing it in the wild)
  • Josh Wills, ex-Google,  now Director of Data Science at Cloudera,  talks about  Experimenting at Scale at the 2012 Workshop on Algorithms for Modern Massive Data Sets, and provides some analytical and experimental techniques for meeting the challenges involved
  • Ron Kohavi, ex-Google, now Microsoft, has a recent talk and a recent paper about puzzling results from experimentation and how his team has solved them, in his 2012 ACM Recsys Keynote speech, and his 2012 KDD paper.

There are some commonalities of approach from the ex-Googlers.   Assigment of people to experiments, and experimental treatments, is done via a system of independent layers, so that an individual user can be in multiple experimental treatments at once.   Kohavi talks about how this can go wrong,  and some ways of designing around it using a modified localised layer structure.

Another efficiency-boosting practice is the use of Bayesian Bandit algorithms to decide on the size of experimental groups, and length of experiment.   This practice is most familiar in clinical trials, where adaptive experimentation is used to ensure that as soon as a robust effect has been found, the trial is halted, enabling the ethically desirable outcome that beneficial treatments are not withheld from those who would benefit, and injurious treatments are stopped as soon as they are identified as such.   It’s so much flavour of the month that there is now a SaaS provider, Conductrics,  which will enable you to use it as a plugin.  They also have a great blog so check it out if you’re interested in this topic.   Google Analytics Content Experiments also provide support for this, in a more constrained way.

So there are lots of hints and tips about optimising the mechanics of running a test.   But there isn’t as much talked about what to test, and how to organise a series of tests.  Which is, for most people, the $64 million question.   This is something I’ve done some thinking on and talking about and advising on.   I’m still working it through, though  – and if you are too, and you know of any interesting resources I’ve missed -do share them with us.

Recommendation systems: the elephant in the room

elephant smallI’ve been catching up with recommendation systems after a 10 year break, and am having trouble getting my head around the conventions of research practice in the area.   There is an elephant in the room, and it’s a big one.  I stubbed my toe pretty quickly on it.

One thing I’m having trouble with is the seemingly baked-into-the-DNA assumption that a recsys is something that predicts performance on a historical test set of data, given initial access to a training set of data.   Typically, the data being analysed is rating data, where people have rated a relatively small set of discrete consumer items, taken from a large and diverse product space.  The Netflix Prize is the most famous example, and you won’t have to go far to find a hackathon (or Kaggle competition)  that is making racks and racks of machines sweat doing something similar.

There are powerful pragmatic reasons why playing with historical data in a sandbox is a popular methodology.   First and foremost: not everyone owns a massive live consumer platform they can tweak around with.  But what you do not get, by playing in the sandbox of the past, is the opportunity to create interventions which help to answer the questions the data to date hasn’t.  (And if you’re still with me after that sentence – rejoice! It’s over!)

The assumption that recsys practice can be significantly advanced by fitting historical data is, on the surface, highly pragmatic.  But it’s flawed.  At the very least I need some means – ideally conceptual but black box would be ok – by which to bridge the gap between this research practice, and design implications for live recsys.   And it’s not clear what that bridge looks like, without actually asking that question of the world, and attending closely to the answer.

I was recently doing a bit of market basket analysis with the aim of tweaking a set of cross promotions, and it occurred to me that while what I was doing was probably going to be provably better than doing nothing, there was a whiff of elephant dung in the air.  Why?  I was assuming that people’s “natural”  purchase patterns were a sensible first-pass model to follow, to create an effective promotion.    But, even abstracting away from issues of per-item net return, there are a basketload of interesting reasons why this might not be right:

  • there may be transient value in offer variation for its own sake, for basic perceptual reasons
  • I could be cannibalising a purchase which would have been made anyway
  • it may be more effective long term to promote an item which promotes exploration of a different category, which could broaden the customer’s basis for a relationship

Purchases are not necessarily the best predictor of promotion success, although in some circumstances they may be good enough.  The right way to evaluate this relationship is empirically.  The surest way to evaluate it empirically is to test it explicitly.  Just as the relationship between purchases and promotions is an interesting one, the relationship between ratings prediction success, recommendation success, and customer value needs a better understood foundation, in order for the work to be of widest possible interest and application.

A live, test-driven recsys is in fact a good way to test out these relationships, and identify which proxies make sense when.  Like the model, the live recsys is powered by  information about the past.  It also tries to optimise results in the ongoing unfolding present.   But the real system is different from the isolated model:  it immediately gets the fertile dirt of reality under its fingernails.   When it makes an offer, it gets feedback about that offer’s success.   That is a powerful basis for learning.   Human learning, machine learning, human learning about machine learning, you name it.   The ability to change the offer in order in response to results, in order to optimise performance and the optimisation itself, creates a different focus for the system objectives and inputs.

I’m interested in the different ways in which the learning that results from live design variation can be driven by, and can inform, a systematic framework for discovery.  Why hill climb in the past, when you can set discovery challenges for the future?

Social graph marketing: I like my friends. But am I like them?

Subaru Six Stars

Image by istargazer via Flickr

According to Facebook’s VP of Partnerships and Platform Marketing, Dan Rose, Facebook’s work with Nielsen shows that social network seeing your friends’ pictures next to a Facebook advertisement leads to “a 60% uptake in brand advertising value” .   I’m not exactly sure what “uptake in brand value” is – and I wasn’t at DLD11, where he made that remark – but the core phenomenon that Rose was talking about isn’t news.   There’s already an aphorism for it dating back, apparently,  to the 1500’s:   Birds of a feather flock together.     (See also, opposites attract…;- )  

The marketing bods version of the ‘birds of a feather’ hypothesis runs something like this:  

People tend to like, and be friends with, people who are similar to them.   People who are similar to each other are similar in many ways, including having similar tastes.    Therefore, your friends are a good source of information about things you might like, not, as you might think, because of what they know about you, but purely because of what they themselves like.     Similarly, knowing what you like is a good predictor of what your friends like – not because you know them well and are sensitive to their needs, but simply because they are your friends, and therefore likely to be similar to you. 

Whatever forces at work here, they are by no means all-powerful.  We have, I am sure,  all given and received presents which are much better barometers of  the giver’s likes than those of the recipient.    I will spare you the details but I recently received a Christmas present that drove this point home to me very strongly.   

But an effect need not be infallible in order to be invaluable.    Do Facebook friends share attributes and preferences, more than you’d expect by chance?  Or more then you’d expect if you knew, say,  basic demographic and psychographic information, but didn’t know ”friend” status? 

To use yet another dodgy hair dye analogy, only Facebook knows for sure.   Facebook, with its knowledge of  its users’ friends, and its knowledge of users’ declared likes, offers a platform which seems tailor-made for exploring the strength, nature, and limits of personal network effects on preferences.   Facebook’s daily operations offer the potential for a large-scale real-time research playground programme of staggering scope and detail.    

Our tendency to be like our friends and our life partners in some ways is a well-documented phenomenon (pop “homophily”, or “assortative mixing”  or “assortative matching” into a search engine if you’d like a quick dip in the surf).   So is the fact that we tend to meet and interact with and become friends with people who are physically close to us.  (Newcombe’s study of this phenomenon in the 1960s seems to have largely held up over time.)    However,  people who are physically close to us may also have been effectively pre-sorted by the universe to  share some of the demographic characteristics important for matching.    So it’s a case of  “not only but also”.

Of course, we do not befriend everyone we have the opportunity to see and interact with frequently.    We can all think of examples, I’m sure, of people we see and interact with every day, who are not  currently  friends, and are unlikely to ever become friends.   No need to name names.   So propinquity, as proximity is sometimes called, is not the whole story.    And neither, of course,  is similarity.

Sit back and think for a moment.  Are you really like your friends?    And is that why you like them?   The answer, probably, is: yes, partly,  in some ways, and no, not always,  in others.   (Ah, the chill wind of common sense.)     Knowing when  friends are likely to be similar to each other in their tastes – and when they aren’t – could be very useful.    Ditto, some knowledge of how strong this effect is, in comparison to other predictive possibilities, helps us to think wisely about what it’s good for, and what it’s not.     But we don’t really know these things in a systematic way – yet.   There are lots of unexplored possibilities in this type of analysis, as well as a large and interesting set of relevant findings from marketing and sociology.  I hope to investigate these issues further in future posts.  For now, let’s just have a little chew on one example. 

I am a Subaru owner.  I believe that I caught this from my sister, who is a happy owner, having done a gruelling daily commute with hers for the last 10 Montreal winters.   I believe that I also passed the Scooby virus on to a friend, who just bought one partly on the strength of my sister’s happiness, and mine.   Contagiousness is highly visible in Subaru-ownership, because of its rarity.    If I bought a Ford, I wouldn’t necessarily be able to trace it back to any particular influence.    But being a Subaru-owner is a niche pleasure, particularly in the UK.    According to one source, only 0.3% of UK new car registrations in December 2010 were Subarus.    

Nonetheless, Subaru is gaining market share.   How?   According to a motor industry guru quoted in a recent Businessweek article,  “They are basically adding people who are Subaru buyers in their hearts, but don’t know it.”    Interesting… 

Although I am a happy Scooby owner, particularly when it is snowing, as it is at this very minute, I am not currently in the market for another Subaru.    (Just as I am Cohen-ed out at present.)    So there isn’t much point marketing Subarus to me.     

But what about my Facebook friends?   They are probably somewhat similar to me, in some ways, as they are my friends.  But they are definitely not similar to me in the sense that none of them own Subarus.  (The gal who bought a Subaru isn’t on Facebook. )  This is pretty much what you would expect, given the rarity of Subaru ownership and the small number of Facebook friends I have.   Even if being my Facebook friend increased your chances of owning a Subaru tenfold, the size of my Friend pool simply isn’t big enough to demonstrate this effect conclusively.

But the interesting question, for Subaru (as well as others),  is whether my Facebook friends more susceptible to Subarus, because they are my friends.  That is to say, are they more susceptible than random people, or than people of similar demographic, psychographic (etc).   

Could my Facebook friends be, as the industry guru put it:  “Subaru buyers at heart, but not know it yet”?   

I don’t know for sure, but my gut feel is some of them are.    That’s certainly the Great Hope of friendship marketing.   It’s possible that Facebook is, even now, figuring out the answer.   Whether friend testimonials work because of some underlying similarity between me and my friends, or because of the trust my friends have in my procurement capabilities, is very much an open question.   But an answerable one.

Similarly, Facebook is undoubtedly working hard on the question of what good my openly declared relationship with Leonard Cohen is as a predictor of my many other susceptibilities.     I’m sure that when they figure it out, they’ll tell me.   (Meanwhile, I’m open to suggestions.)