When you estimate a proportion and want to calculate a standard error for the estimate, you would normally do so based on assuming that the number of ‘successes’ in the sample is a draw from a binomial distribution, which counts the number of successes in a series of independent Bernoulli 0/1 draws, where each draw has a probability of ‘success’. Does the model rely or assume that for each of these binary observations the success probability is the same? In the third paragraph of this blog post Frank Harrell (seems to) argue that it does. In this post I’ll delve into this a bit further, using the same numerical example Frank gives.
Suppose we have a random sample of n individuals on whom we observe a binary outcome indicating presence or absence of disease. Suppose that in a sample of n=100, 40 have the disease, and so our estimate of the proportion of disease in the population (which I will denote ) from the sample was drawn is =40/100=0.4.
To estimate the variance of this sample proportion, the conventional variance estimator is where is the sample proportion. This variance estimator is derived from the fact that if you observe a series of independent identically distributed Bernoulli draws, the number of ‘successes’ (here corresponding to individuals with disease) has variance . In our case, the variance estimator takes value .
Now suppose that we are given information on the sex of the 100 individuals: 40 were female and 60 were male. Among the 40 females, 10 had the disease and among the males, 30 had the disease. Given this information, we might think to use the sex information to estimate the overall proportion with the disease, by calculating the sex-specific proportions and weighting them by the proportions of females and males:
This is identical to the estimate that ignored sex, and will be if we use the sample proportions of females/males.
The estimated variance based on this, which accounts for sex is (according to Frank’s post):
Frank explains that the first variance estimator assumes `i.e., assuming risk for females = risk for males’, whereas
the correct variance (i.e. the 0.00225) recognizes that males and females do not have the same outcome probabilities. So marginal stratified/adjusted estimation corrects the mistake of using the wrong variance formulas when computing crude marginal estimates
and
Any time you see for the variance of a proportion, remember that this formula assumes that p applies equally to all n subjects.
Given that there will always be heterogeneity between individuals due to various measured and unmeasured factors, it would be worrying if this were true. I do not think it is though.
A small simulation investigation
To investigate further, let’s first run a simulation in R. We take samples of size n=100 from the population. I will take the population proportion of females to be 40%, as per the sample values in the example above, and I will take the true sex specific disease proportions to be as in the example above. The following code simulates 1,000,000 times, calculates and the two variance estimators described above. It then compares the empirical variance of with the mean of the two variance estimators:
set.seed(72424)
n <- 100
nSim <- 1000000
phat <- array(0, dim=nSim)
varest1 <- array(0, dim=nSim)
varest2 <- array(0, dim=nSim)
for (i in 1:nSim) {
female <- rbinom(n, size=1, prob=0.4)
probd <- 0.25*female + 0.5*(1-female)
y <- rbinom(n, size=1, prob=probd)
phat[i] <- mean(y)
varest1[i] <- (mean(y)*(1-mean(y)))/n
varest2[i] <- (mean(female)^2)*(mean(y[female==1])*(1-mean(y[female==1])))/sum(female) +
((1-mean(female))^2)*(mean(y[female==0])*(1-mean(y[female==0])))/(n-sum(female))
}
mean(phat)
var(phat)
mean(varest1)
mean(varest2)
Empirical variance of | 0.002403904 |
Mean of first, marginal variance estimator | 0.002375905 |
Mean of stratified variance estimator | 0.002206166 |
The first marginal variance estimator has a mean in repeated sample slightly below the empirical variance, while the second is (proportionately) quite a bit smaller. To explore further, we now re-run the simulation but modifying the way the female variable is generated. Specifically, we now fix there to be 40 females and 60 males in each sample:
set.seed(72424)
n <- 100
nSim <- 1000000
phat <- array(0, dim=nSim)
varest1 <- array(0, dim=nSim)
varest2 <- array(0, dim=nSim)
for (i in 1:nSim) {
#now exactly 40% female
female <- c(rep(1,n*0.4),rep(0,n*0.6))
probd <- 0.25*female + 0.5*(1-female)
y <- rbinom(n, size=1, prob=probd)
phat[i] <- mean(y)
varest1[i] <- (mean(y)*(1-mean(y)))/n
varest2[i] <- (mean(female)^2)*(mean(y[female==1])*(1-mean(y[female==1])))/sum(female) +
((1-mean(female))^2)*(mean(y[female==0])*(1-mean(y[female==0])))/(n-sum(female))
}
mean(phat)
var(phat)
mean(varest1)
mean(varest2)
The results of this slightly modified simulation are:
Empirical variance of | 0.002245051 |
Mean of first, marginal variance estimator | 0.002377571 |
Mean of stratified variance estimator | 0.002206344 |
The main change from the first set of results is the empirical variance of , which is now closer to the (mean of the) second sex-stratified variance estimator. Thus the first variance estimator, which ignores sex, seems better when the sex of the individuals in the sample is random, whereas the sex-stratified variance estimator seems better when the proportion female/male is fixed in repeated samples.
Understanding the results analytically
The simulations above are arguably not that convincing. To understand what’s going on analytically, let and denote the estimators of the disease proportion in females and males, and let denote the sample proportion of females. The estimator for the overall proportion with disease (remembering the marginal and sex-specific estimators were identical) can be written as
To find the variance of this, we can appeal to the law of total variance, which yields
The variance inside the second term here is
Replacing the and terms with estimates of these using the standard variance formula for a proportion, we have
where we use the fact that the number of females in the sample is and the number of males in the sample is . The expression derived in the above line corresponds to the second, sex-stratified variance estimator we used earlier, and the derivations we have performed to arrive at it make clear that it treats the proportion female/male as fixed in repeated sampling. When this is true, it is a reasonable variance estimator to use. But if in repeated samples the sex ratio is not fixed, but is random, consistent with simple random sampling of individuals from the population, then we must also account for the first term in the total variance, .
where and denote the true/population disease proportions in females and males. The first term in the law of total variance expansion is then
We could estimate this quantity by replacing unknowns by sample estimates, and combining with our estimate of the second component of the law of total variance expansion, we have
Some algebra shows that the preceding quantity is in fact identical to the variance estimator which ignores sex completely.
In conclusion, the appropriate variance/SE formula depends on the sampling process used. The one that ignores sex is appropriate if simple random sampling was used, whereas the sex-stratified one is appropriate if the sampling was performed such that the sex ratio in the sample was fixed at a particular value. These follow from the sampling scheme used to collect the data, and not whether an individual’s risk of getting/having disease is the same or different to another’s.
If you spot a mistake in the above, or disagree, please leave a comment.
An issue is whether the probability distribution describing the mixture of sample sex ratios that can occur can be used for inference once the actual sec ratio has been observed. If the population sex ratio is known, some might argue it cannot be.
This is analogous to the randomisation case where ANCOVA uses the observed covariate distribution rather than ignoring it because of randomisation.
Within the sampling community this has been a lively debate since at least the 1970s with Richard Royall a prominent critic of approaches that ignore the covariate (sex) in your example.
So you have raised some interesting issue, which I shall think about some more. At the moment my gut feeling is that you are right when nothing else is known because a probability of a probability is a probability but I am not sure the argument survives once other things are observed.
Thought provoking as usual, thanks Jonathan! There’s an interesting, perhaps obvious, parallel here to stratified randomisation and covariate adjustment.
– If we sample fixed proportions (parallel to stratified randomisation), as in your second simulation study, we fix the variance of the point estimator, so that the stratified variance estimator is appropriate (it doesn’t matter that your point estimator ignored the stratifying covariate). The marginal variance estimator is conservative in this case. This is similar to what we see with stratified randomisation.
– If instead we draw a random sample without fixing the proportions (parallel to simple randomisation), as in your first simulation study, we could ignore the stratifying variable and the expectation of the unstratified variance estimator would be equal to the variance of our point estimator, as you show.
– I thought your point about point estimation accounting for the stratifying variable was where the parallel breaks down. In the context of covariate adjustment, we can use covariate adjustment after simple randomisation to reduce the variance of the point estimator. In the case you discuss above, the point estimator in simulation study 1 is mean(y), but using (mean(female)*mean(y[female==1]))+((1-mean(female))*mean(y[female==0])) is identical.
– Perhaps this is a better parallel. In your first simulation study above, if we knew the population mean(female)=0.4, and chose not to fix the proportions sampled, we could similarly reduce the variance of the point estimator by using that known true proportion, not observed. This does indeed reduce the variance of the point estimator (see below, phat3). It’s now much closer to the stratified one but (I think) slightly higher (I didn’t really think about varest3 but, below, var(phat3)≈mean(varest3)).
set.seed(72424)
n <- 100
nSim <- 100000
phat1 <- array(0, dim=nSim)
phat2 <- array(0, dim=nSim)
phat3 <- array(0, dim=nSim)
varest1 <- array(0, dim=nSim)
varest2 <- array(0, dim=nSim)
varest3 <- array(0, dim=nSim)
for (i in 1:nSim) {
female <- rbinom(n, size=1, prob=0.4)
probd <- 0.25*female + 0.5*(1-female)
y <- rbinom(n, size=1, prob=probd)
phat1[i] <- mean(y)
phat2[i] <- (mean(female)*mean(y[female==1]))+((1-mean(female))*mean(y[female==0])) # observed proportions
phat3[i] <- (.4*mean(y[female==1]))+(.6*mean(y[female==0])) # true proportions
varest1[i] <- (mean(y)*(1-mean(y)))/n
varest2[i] <- (mean(female)^2)*(mean(y[female==1])*(1-mean(y[female==1])))/sum(female) +
((1-mean(female))^2)*(mean(y[female==0])*(1-mean(y[female==0])))/(n-sum(female))
varest3[i] <- (.4^2)*(mean(y[female==1])*(1-mean(y[female==1])))/sum(female) +
(.6^2)*(mean(y[female==0])*(1-mean(y[female==0])))/(n-sum(female)) # true proportions
}
var(phat1)
var(phat2)
var(phat3)
mean(varest1)
mean(varest2)
mean(varest3)