Recommendation systems: the elephant in the room

elephant smallI’ve been catching up with recommendation systems after a 10 year break, and am having trouble getting my head around the conventions of research practice in the area.   There is an elephant in the room, and it’s a big one.  I stubbed my toe pretty quickly on it.

One thing I’m having trouble with is the seemingly baked-into-the-DNA assumption that a recsys is something that predicts performance on a historical test set of data, given initial access to a training set of data.   Typically, the data being analysed is rating data, where people have rated a relatively small set of discrete consumer items, taken from a large and diverse product space.  The Netflix Prize is the most famous example, and you won’t have to go far to find a hackathon (or Kaggle competition)  that is making racks and racks of machines sweat doing something similar.

There are powerful pragmatic reasons why playing with historical data in a sandbox is a popular methodology.   First and foremost: not everyone owns a massive live consumer platform they can tweak around with.  But what you do not get, by playing in the sandbox of the past, is the opportunity to create interventions which help to answer the questions the data to date hasn’t.  (And if you’re still with me after that sentence – rejoice! It’s over!)

The assumption that recsys practice can be significantly advanced by fitting historical data is, on the surface, highly pragmatic.  But it’s flawed.  At the very least I need some means – ideally conceptual but black box would be ok – by which to bridge the gap between this research practice, and design implications for live recsys.   And it’s not clear what that bridge looks like, without actually asking that question of the world, and attending closely to the answer.

I was recently doing a bit of market basket analysis with the aim of tweaking a set of cross promotions, and it occurred to me that while what I was doing was probably going to be provably better than doing nothing, there was a whiff of elephant dung in the air.  Why?  I was assuming that people’s “natural”  purchase patterns were a sensible first-pass model to follow, to create an effective promotion.    But, even abstracting away from issues of per-item net return, there are a basketload of interesting reasons why this might not be right:

  • there may be transient value in offer variation for its own sake, for basic perceptual reasons
  • I could be cannibalising a purchase which would have been made anyway
  • it may be more effective long term to promote an item which promotes exploration of a different category, which could broaden the customer’s basis for a relationship

Purchases are not necessarily the best predictor of promotion success, although in some circumstances they may be good enough.  The right way to evaluate this relationship is empirically.  The surest way to evaluate it empirically is to test it explicitly.  Just as the relationship between purchases and promotions is an interesting one, the relationship between ratings prediction success, recommendation success, and customer value needs a better understood foundation, in order for the work to be of widest possible interest and application.

A live, test-driven recsys is in fact a good way to test out these relationships, and identify which proxies make sense when.  Like the model, the live recsys is powered by  information about the past.  It also tries to optimise results in the ongoing unfolding present.   But the real system is different from the isolated model:  it immediately gets the fertile dirt of reality under its fingernails.   When it makes an offer, it gets feedback about that offer’s success.   That is a powerful basis for learning.   Human learning, machine learning, human learning about machine learning, you name it.   The ability to change the offer in order in response to results, in order to optimise performance and the optimisation itself, creates a different focus for the system objectives and inputs.

I’m interested in the different ways in which the learning that results from live design variation can be driven by, and can inform, a systematic framework for discovery.  Why hill climb in the past, when you can set discovery challenges for the future?