Why understand matching estimators?
What piqued my interest in matching estimators was Christophe Safferling from Ubisoft’s talk at a recent Games Industry Analytics Forum event. He talked about using matching estimators as an alternative to A/B testing, and showed how Ubisoft’s insights into the impact of a new, additional payment option were transformed by looking at the problem in the right way. This made me realise I should probably understand what he was saying better than I did, and that is the reason I’m writing this post. Trying to explain something is a great way of trying to understand it.
Christophe motivated the discussion by explaining that although A/B testing is usually the best way to assess the impact of a design change, one of the problems with it is that it isn’t always possible to do it. How a design moves forward, in practice, is often via the addition of a shiny new feature that is presented to the user an optional extra. Once the addition is implemented, we then need to understand whether or not it is an improvement. To do that, so we observe what becomes of the people who engage with it, compared to the people who did not.
The difficulty with this approach lies in what to make of the result. Which is where matching estimators come in. (And jellybeans.)
Why use jellybeans?
Personally, I find jellybeans a useful formalism. But the risk of beans going missing during a calculation can be reduced by using different types of beans.
One way to understand the pitfalls involved in interpretation of outcomes from self-selected groupings is to contrast it with the interpretation of outcomes from A/B testing.
A/B and multivariate testing typically employs random assignment of people to experimental treatments (e.g. varying something about the game design), and compares the outcome to that of a control group. You then observe the outcomes in the different groups. When you see that the treatment vs control groups have different outcomes, it’s relatively straightforward to figure out what that means.
If you see differences, it does not necessarily mean that the differences you see are due to the different treatments the groups received. It is possible that the observed differences are due to inherent population variation or measurement error. There are well understood ways of assessing this. You should give your favourite statistician a jellybean and observe the result. This flow is shown in Figure 1.
This is classic stuff. It’s what Google, Amazon, Facebook, and the other big platform providers do a lot of – every day.
But it’s not necessarily an appropriate way of analysing outcomes when people have self-selected into different groups.
What happens with self-selecting groups is similar in some ways to what happens in a randomised trial: the groups get different experiences, and the outcome is observed. But the interpretation of the result isn’t as straightforward. The complication is that the groups might be different because of characteristics of people who make one choice rather than the other, rather than because of the consequence of the choice itself.
In the example in Figure 2, we can see that the people who chose to “sparkle” are different from those who don’t. So it’s not obvious whether what’s responsible for the outcome is the attribute, or the effect of the treatment (or, indeed, a combination of the two).
The core mechanic of the matching estimator method for evaluating the effect of a self-selected treatment condition is to match each individual in the treatment group with a best-matching individual (or composite individual) in the control group, and then compare these matched pairs using established statistical methods for matched samples. There are a huge number of variations and refinements on this theme but they all share this basic idea of finding an appropriate match for each individual in the treatment group.
This method can reveal differences that are due to the difference in experience, rather than self-selection, because it tries to select, post-hoc, a control group which very closely matches the characteristics of of the self-selected group. In the case of the jelly-beans, a matching estimator could be constructed by comparing the yellow beans in each group.
The devil in the detail of using matching estimators lies in choosing which of many possible attributes to consider, and which similarity functions to use, when assessing similarity and finding a match. Even if you consider a relatively simple entity such as jellybean, it has a lot of attributes which could be a potential basis for a match, as you can see in Figure 3. And people are much more complicated than jellybeans.
This is where the matching algorithms such as Mahalanobis matching mentioned by Christophe come into play: it is one of the many different similarity metrics and similarity-revealing analytical methods that can be used as a basis for finding the most useful match.
As to the question of selecting which attributes to match on, there is conflicting guidance about whether the attributes used for matching should be strong covariates of the treatment assignment or of the outcome measure. But there is agreement that the attributes used should not themselves be affected by the treatment, should not be perfectly predictive of the treatment, and should have a similar or at least overlapping distribution in both the treatment and the control group.
Horses for courses
Matching estimators are a useful, established technique which is used extensively in progamme evaluation and econometrics. There is a big literature around the topic. If you want to know a bit more, I’d recommend starting out, as I did, by looking at the chapter by Professor Petra Todd in Palgrave’s Dictionary of Economics.
Sometimes what is most interesting about the effect of an extra option is the way the option acts to filter people into groups with different characteristics. In this case, abstracting away from differences between groups is precisely not what you want. Matching the investigative technique to what’s interesting is an important form of matching, too. When doing analytical work it is important to stop and smell the flowers down the garden path.
If you want to know even more, here are some online lecture notes I’ve found useful for exploring this topic:
- Guido Imbens, Stanford, talking at UCL
- Michael Roberts, Wharton, from his course on Empirical Methods in Corporate Finance
- Scott Rozelle, Stanford, from his course at the LICOS Centre for Institutions and Economic Performance
- Elizabeth Stuart, Johns Hopkins, lecture for the Society for Prevention Research
- Jeff Wooldridge, Michigan, course for the International Institute of Labour course in Microeconomics
If I need to know more, I think Paul Rosenbaum’s book on the Design of Observational Studies will be my next stop.
Two good sources I didn’t see mentioned are “Mostly Harmless Econometrics” and “Propensity Score Matching”. The issue I don’t see covered in many of these is “Once I’ve identified a decent matching algorithm, it will have residuals. How do I account for this error in my subsequent model of the outcome variable?”
I suppose what I would do to account for the residuals is ask you 🙂
Pingback: Heather Stark writes about Matching Estimators – an alternative to A/B testing | Anders Drachen: Game Analytics