I've just watched a highly thought provoking presentation by Gary King of Harvard, available here https://youtu.be/rBv39pK1iEs, on why propensity score matching should not be used to adjust for confounding in observational studies. The presentation makes great use of graphs to explain the concepts and arguments for some of the issues with propensity score matching.
Gary's starting point is to compare completely randomised experiments, where treatment is assigned entirely randomly, with blocked/stratified randomised experiments. The former ensure that in expectation/on average, all potential confounders, measured and unmeasured, are balanced between treatment groups. The latter go further by making the treatment groups sample balanced (i.e. not just in expectation) in respect of the variables used to block/stratify subjects. The statistical benefit of this design is to improve the precision of treatment effect estimates, by making the treatment group and covariates used in the blocking to be completely orthogonal/independent in the sample. Moreover, when such independence is constructed, estimates of treatment effect are largely unaffected by how the analyst chooses to model the effect of the covariates, leading to a desirable lack of 'model dependence'.
Next, Gary explains how, as is well known from the original propensity score papers, if you take two individuals with identical values of the propensity score, they will not have identical covariate values, but in expectation their distribution will be the same. As such, if you perform propensity score matching, you are attempting to reconstruct the completely randomised experiment, where covariates are balanced on average. In contrast, other matching approaches, e.g. matching based on distance metrics such as the Malhanobis or Euclidean distance, do better because they attempt to mimic the blocked randomised trial, which as described earlier gives more precise treatment effect estimates. This is because they attempt to find matches which have similar values of all covariates, not just similar propensity scores.
A follow on point then made is that if you do match based on a covariate distance metric, that you should scale the covariates before calculating the metric based on your a priori knowledge about the relative importance of the covariates in their effects on outcome. I guess a drawback of this advice is then that the analyst has the non-trivial and potentially quite subjective task of deciding how to rescale each covariate.
One issue with matching in this way is that as the number of covariates grows, the chance of finding matches who have similar values of all covariates rapidly goes to zero. In their accompanying paper in a footnote (page 16) they acknowledge that the so called curse of dimensionality affects every matching method, and state that propensity score matching doesn't solve this curse. One thought however is that since propensity score matching doesn't claim to match individuals such that they have identical (or near identical) covariate values, it somewhat side steps the problem by attempting to achieve a more limited goal.
One of the other key messages is regarding 'the propensity score paradox'. To explain this, imagine that in the dataset treatment is almost sample independent of two potential confounders. In this case, the propensity score will hardly vary with the value of the two potential confounders. Matching treated subjects to untreated subjects using the propensity score then amounts to essentially randomly picking a control. On average, this randomly picked control will have covariate values further away from the treated subject's values, than in the original full sample or if you were to match based on a covariate distance metric like Euclidean distance. As such, it is argued that propensity score matching can increase confounder imbalance, thereby leading to estimates of exposure effects with greater bias.
An important final point made is that the results do not necessarily imply problems with alternative approaches which use the propensity score, such as inverse weighting or regression adjustment.
If anyone thinks I have misunderstood/misrepresented anything (quite likely), please add a comment to clarify! Watching Gary King's presentation / reading the paper is highly recommended.
A draft of the accompanying paper can be viewed here http://gking.harvard.edu/files/gking/files/psnot.pdf