I've just watched a highly thought provoking presentation by Gary King of Harvard, available here https://youtu.be/rBv39pK1iEs, on why propensity score matching should not be used to adjust for confounding in observational studies. The presentation makes great use of graphs to explain the concepts and arguments for some of the issues with propensity score matching.

Gary's starting point is to compare completely randomised experiments, where treatment is assigned entirely randomly, with blocked/stratified randomised experiments. The former ensure that in expectation/on average, all potential confounders, measured and unmeasured, are balanced between treatment groups. The latter go further by making the treatment groups sample balanced (i.e. not just in expectation) in respect of the variables used to block/stratify subjects. The statistical benefit of this design is to improve the precision of treatment effect estimates, by making the treatment group and covariates used in the blocking to be completely orthogonal/independent in the sample. Moreover, when such independence is constructed, estimates of treatment effect are largely unaffected by how the analyst chooses to model the effect of the covariates, leading to a desirable lack of 'model dependence'.

Next, Gary explains how, as is well known from the original propensity score papers, if you take two individuals with identical values of the propensity score, they will not have identical covariate values, but in expectation their distribution will be the same. As such, if you perform propensity score matching, you are attempting to reconstruct the completely randomised experiment, where covariates are balanced on average. In contrast, other matching approaches, e.g. matching based on distance metrics such as the Malhanobis or Euclidean distance, do better because they attempt to mimic the blocked randomised trial, which as described earlier gives more precise treatment effect estimates. This is because they attempt to find matches which have similar values of all covariates, not just similar propensity scores.

A follow on point then made is that if you do match based on a covariate distance metric, that you should scale the covariates before calculating the metric based on your a priori knowledge about the relative importance of the covariates in their effects on outcome. I guess a drawback of this advice is then that the analyst has the non-trivial and potentially quite subjective task of deciding how to rescale each covariate.

One issue with matching in this way is that as the number of covariates grows, the chance of finding matches who have similar values of all covariates rapidly goes to zero. In their accompanying paper in a footnote (page 16) they acknowledge that the so called curse of dimensionality affects every matching method, and state that propensity score matching doesn't solve this curse. One thought however is that since propensity score matching doesn't claim to match individuals such that they have identical (or near identical) covariate values, it somewhat side steps the problem by attempting to achieve a more limited goal.

One of the other key messages is regarding 'the propensity score paradox'. To explain this, imagine that in the dataset treatment is almost sample independent of two potential confounders. In this case, the propensity score will hardly vary with the value of the two potential confounders. Matching treated subjects to untreated subjects using the propensity score then amounts to essentially randomly picking a control. On average, this randomly picked control will have covariate values further away from the treated subject's values, than in the original full sample or if you were to match based on a covariate distance metric like Euclidean distance. As such, it is argued that propensity score matching can increase confounder imbalance, thereby leading to estimates of exposure effects with greater bias.

An important final point made is that the results do not necessarily imply problems with alternative approaches which use the propensity score, such as inverse weighting or regression adjustment.

If anyone thinks I have misunderstood/misrepresented anything (quite likely), please add a comment to clarify! Watching Gary King's presentation / reading the paper is highly recommended.

A draft of the accompanying paper can be viewed here http://gking.harvard.edu/files/gking/files/psnot.pdf

I can follow King's argument about propensity score matching. that is, it mimics completely randomized designs and not stratified randomized design. Propensity score is the probability of being assigned to treatment, not a natural covariate. I would expect it to be weakly associated with outcome because it is treatment, not treatment assignment, actually impact outcome. So matching or conditional on PS is not like matching or conditioning on covariates.

What really confuses me is their conclusion about bias in estimated treatment effect. I understand increasing imbalance as random pruning increases part. To me, it is like what we have in randomized trials with small sample sizes. we often observe differences between baseline covariates in small sized studies. However, I would expect if we do randomization correctly and if we can repeat the procedure again and again, the averaged treatment effect (ATE) should still be consistent for the true treatment effect.

It's worthing noting that King's paper examined this sample average treatment effect (SATE), not ATE. I can understand imbalance in one sample may cause bias in estimating SATE "intuitively". However, it is just so hard for me to picture what this SATE is and why we should be interested in this sample quantity, not population quantity? Somehow the paper's title sounds like propensity score matching is just totally wrong even for ATE. That impression totally contradicts with the conclusions from Rosenbaum's seminal paper.

I would appreciate if you could share your thoughts on this PSM bias in estimating SATE (or ATE).

Thanks for the link, I'll check it out. The thesis is eminently sensible. This is what I said in the second edition of Statistical Issues in Drug Development (2007)

"it is my view that the propensity score is both superfluous and misleading. Essentially my position is that there is nothing more and beyond analysis of

covariance, which dictates that we must adjust for covariates because they are predictive of outcome not because they are predictive of assignment. Consider, for instance, a clinical trial in asthma with two strata at randomization according to whether patients are currently on steroids or not, and suppose that the outcome variable was FEV1, which one would usually analyse using a linear model. If the allocation ratio, which would most commonly be 1:1, were the same in the two strata then the treatment estimate would be the same whether or not one stratified the analysis. The propensity score would be the same for the two strata and this would suggest that according to that philosophy no adjustment was required. However, if as might plausibly be the case, FEV1 were different between the two strata, the standard error of the overall treatment estimate would be quite different depending on whether or not steroid use was in the model. The philosophy of analysis of covariance, which chooses factors that are predictive of outcome, would have it in the model. The propensity score philosophy,

which chooses factors that are predictive of assignment, would not. In my view this is a mistake.

Thus, I do not accept that the propensity score is a useful alternative to analysis of covariance. In my opinion it will produce either similar or inferior inferences."

In fact, it was hearing Erika Graf many years ago (about 1990) give a penetrating discussion of the propensity score that woke me to the problem. Many years later we published a paper together with Angelika Caputo on the subject. 1. Senn SJ, Graf E, Caputo A. Stratification for the propensity score compared with linear regression techniques to assess the effect of treatment or exposure. Statistics in Medicine 2007;26(30):5529-44.

I also highly recommend looking at Hui and Dawid http://proceedings.mlr.press/v9/guo10a/guo10a.pdf