### Effective Sample Size

#### Posted by Tom Leinster

On a scale of 0 to 10, how much does the average citizen of the Republic of Elbonia trust the president?

You’re conducting a survey to find out, and you’ve calculated that in order to get the precision you want, you’re going to need a sample of 100 statistically independent individuals. Now you have to decide how to do this.

You could stand in the central square of the capital city and survey the next 100 people who walk by. But these opinions won’t be independent: probably politics in the capital isn’t representative of politics in Elbonia as a whole.

So you consider travelling to 100 different locations in the country and asking one Elbonian at each. But apart from anything else, this is far too expensive for you to do.

Maybe a compromise would be OK. You could go to 10 locations and ask… 20 people at each? 30? How many would you need in order to match the precision of 100 independent individuals — to have an “effective sample size” of 100?

The answer turns out to be closely connected to a quantity I’ve written about many times before: magnitude. Let me explain…

The general situation is that we have a large population of individuals (in
this case, Elbonians), and with each there is associated a real number
(in this case, their level of trust in the president). So we have a probability
distribution, and we’re interested in discovering some statistic $\theta$
(in this case, the mean, but it might instead be the median
or the variance or the 90th percentile). We do this by taking some sample
of $n$ individuals, and then doing *something* with the sampled data to
produce an estimate of $\theta$.

The “something” we do with the sampled data is called an **estimator**.
So, an estimator is a real-valued function on the set of possible sample
data. For instance, if you’re trying to estimate the mean of the
population, and we denote the sample data by $Y_1, \ldots, Y_n$, then the
obvious estimator for the population mean would be just the sample mean,

$\frac{1}{n} Y_1 + \cdots + \frac{1}{n} Y_n.$

But it’s important to realize that the best estimator for a given statistic
of the population (such as the mean) needn’t be that same statistic applied
to the sample. For example, suppose we wish to know the mean mass of
men from Mali. Unfortunately, we’ve only weighed three men from Mali, and
two of them are brothers. You *could* use

$\frac{1}{3} Y_1 + \frac{1}{3} Y_2 + \frac{1}{3} Y_3$

as your estimator, but since body mass is somewhat genetic, that would give undue importance to one particular family. At the opposite extreme, you could use

$\frac{1}{2} Y_1 + \frac{1}{4} Y_2 + \frac{1}{4} Y_3$

(where $Y_1$ is the mass of the non-brother). But that would be going too
far, as it gives the non-brother as much importance as the two brothers put
together. Probably the best answer is somewhere in between. Exactly
*where* in between depends on the correlation between masses of brothers,
which is a quantity we might reasonably estimate from data gathered elsewhere
in the world.

(There’s a deliberate echo here of something I wrote previously: in what proportions should we sow poppies, Polish wheat and Persian wheat in order to maximize biological diversity? The similarity is no coincidence.)

There are several qualities we might seek in an estimator. I’ll focus on two.

*High precision*The**precision**of an estimator is the reciprocal of its variance. To make sense of this, you have to realize that estimators are random variables too! An estimator with high precision, or low variance, is not much changed by the effects of randomness. It will give more or less the same answer if you run it multiple times.For instance, suppose we’ve decided to do the Elbonian survey by asking 30 people in each of the 5 biggest cities and 20 people from each of 3 chosen villages, then taking some specific weighted mean of the resulting data. If that’s a high-precision estimator, it will give more or less the same final answer no matter which specific Elbonians happen to have been stopped by the pollsters.

*Unbiased*An estimator of some statistic is**unbiased**if its expected value is equal to that statistic for the population.For example, suppose we’re trying to estimate the variance of some distribution. If our sample consists of a measly two individuals, then the variance of the sample is likely to be much less than the variance of the population. After all, with only two individuals observed, we’ve barely begun to glimpse the full variation of the population as a whole. It can actually be shown that with a sample size of two, the expected value of the sample variance is half the population variance. So the sample variance is a biased estimator of the population variance, but twice the sample variance is an unbiased estimator.

(Being unbiased is perhaps a less crucial property of an estimator than it might at first appear. Suppose the boss of a chain of pizza takeaways wants to know the average size of pizzas ordered. “Size” could be measured by diameter — what you order by — or area — what you eat. But since the relationship between diameter and area is quadratic rather than linear, an unbiased estimator of one will be a biased estimator of the other.)

No matter what statistic you’re trying to estimate, you can talk about the “effective sample size” of an estimator. But for simplicity, I’ll only talk about estimating the mean.

Here’s a loose definition:

The

effective sample sizeof an estimator of the population mean is the number $n_{eff}$ with the property that our estimator has the same precision (or variance) as the estimator got by sampling $n_{eff}$ independent individuals.

Let’s unpack that.

Suppose we choose $n$ individuals at random from the population (with replacement, if you care). So we have independent, identically distributed random variables $Y_1, \ldots, Y_n$. As above, we take the sample mean

$\frac{1}{n} Y_1 + \cdots + \frac{1}{n} Y_n$

as our estimator of the population mean. Since variance is additive for
*independent* random variables, the variance of this estimator is

$n \cdot Var\Bigl( \frac{1}{n} Y_1 \Bigr) = n \cdot \frac{1}{n^2} Var(Y_1) = \frac{\sigma^2}{n}$

where $\sigma^2$ is the population variance. The precision of the estimator is, therefore, $n/\sigma^2$. That makes sense: as your sample size $n$ increases, the precision of your estimate increases too.

Now, suppose we have some other estimator $\hat{\mu}$ of the population mean. It’s a random variable, so it has a variance $Var(\hat{\mu})$. The effective sample size of the estimator $\hat{\mu}$ is the number $n_{eff}$ satisfying

$\sigma^2/n_{eff} = Var(\hat{\mu}).$

This doesn’t entirely make sense, as the unique number $n_{eff}$ satisfying
this equation needn’t be an integer, so we can’t sensibly talk about a
sample of size $n_{eff}$. Nevertheless, we can absolutely rigorously
define the **effective sample size** of our estimator $\hat{\mu}$ as

$n_{eff} = \sigma^2/\Var(\hat{\mu}).$

And that’s the definition. Differently put,

$\text{effective sample size} = \text{precision } \times \text{population variance}.$

Trivial examplesIf $\hat{\mu}$ is the mean value of $n$ uncorrelated individuals, then the effective sample size is $n$. If $\hat{\mu}$ is the mean value of $n$ extremely highly correlated individuals, then the variance of the estimator is little less than the variance of a single individual, so the effective sample size is little more than $1$.

Now, suppose our pollsters have come back from their trips to various parts of Elbonia. Together, they’ve asked $n$ individuals how much they trust the president. We want to take that data and use it to estimate the population mean — that is, the mean level of trust in the president across Elbonia — in as precise a way as possible.

We’re going to restrict ourselves to unbiased estimators, so that the
expected value of the estimator is the population mean. We’re also going
to consider only **linear estimators**: those of the form

$a_1 Y_1 + \cdots + a_n Y_n$

where $Y_1, \ldots, Y_n$ are the trust levels expressed by the $n$ Elbonians surveyed.

Question:

What choice of unbiased linear estimator maximizes the effective sample size?

To answer this, we need to recall some basic statistical notions…

### Correlation and covariance

Variance is a quadratic form, and covariance is the corresponding bilinear
form. That is, take two random variables $X$ and $Y$, with respective
means $\mu_X$ and $\mu_Y$. Then their **covariance** is

$Cov(X, Y) = E((X - \mu_X)(Y - \mu_Y)).$

This is bilinear in $X$ and $Y$, and $Cov(X, X) = Var(X)$.

$Cov(X, Y)$ is bounded above and below by $\pm \sigma_X \sigma_Y$, the
product of the standard deviations. It’s natural to normalize, dividing
through by $\sigma_X \sigma_Y$ to obtain a number between $-1$ and $1$.
This gives the **correlation coefficient**

$\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X\sigma_Y} \in [-1, 1].$

Alternatively, we can first scale $X$ and $Y$ to have variance $1$, then take the covariance, and this also gives the correlation:

$\rho_{X, Y} = Cov(X/\sigma_X, Y/\sigma_Y).$

Now suppose we have $n$ random variables, $Y_1, \ldots, Y_n$. The
**correlation matrix** $R$ is the $n \times n$ matrix whose $(i, j)$-entry
is $\rho_{Y_i, Y_j}$. Correlation matrices have some easily-proved properties:

The entries are all in $[-1, 1]$.

The diagonal entries are all $1$.

The matrix is symmetric.

The matrix is positive semidefinite. That’s because the corresponding quadratic form is $(a_1, \ldots, a_n) \mapsto Var(\sum a_i Y_i/\sigma_i)$, and variances are nonnegative.

And actually, it’s not so hard to prove that *any* matrix with these
properties is the correlation matrix of some sequence of random variables.

In what follows, for simplicity, I’ll quietly assume that the correlation
matrices we encounter are *strictly* positive definite. This only amounts to
assuming that no linear combination of the $Y_i$s has variance zero —
in other words, that there are no *exact* linear relationships between the
random variables involved.

### Back to the main question

Here’s where we got to. We surveyed $n$ individuals from our population,
giving $n$ identically distributed *but not necessarily independent* random
variables $Y_1, \ldots, Y_n$. Some of them will be correlated because of
geographical clustering.

We’re trying to use this data to estimate the population mean in as precise a way as possible. Specifically, we’re looking for numbers $a_1, \ldots, a_n$ such that the linear estimator $\sum a_i Y_i$ is unbiased and has the maximum possible effective sample size.

The effective sample size was defined as $n_{eff} = \sigma^2/Var(\sum a_i Y_i)$, where $\sigma^2$ is the variance of the distribution we’re drawing from. Now we need to work out the variance in the denominator.

Let $R$ denote the correlation matrix of $Y_1, \ldots, Y_n$. I said a moment ago that $(a_1, \ldots, a_n) \mapsto Var (\sum a_i Y_i)$ is the quadratic form corresponding to the bilinear form represented by the covariance matrix. Since each $Y_i$ has variance $\sigma^2$, the covariance matrix is just $\sigma^2$ times the correlation matrix $R$. Hence

$Var(a_1 Y_1 + \cdots + a_n Y_n) = \sigma^2 \cdot a^\ast R a$

where $\ast$ denotes a transpose and $a = (a_1, \ldots, a_n)$.

So, the effective sample size of our estimator is

$1/a^\ast R a.$

We also wanted our estimator to be unbiased. Its expected value is

$E(a_1 Y_1 + \cdots + a_n Y_n) = (a_1 + \cdots + a_n) \mu$

where $\mu$ is the population mean. So, we need $\sum a_i = 1$.

Putting this together, the maximum possible effective sample size among all unbiased linear estimators is

$\sup \Bigl\{ \frac{1}{a^\ast R a} \, : \, a \in \mathbb{R}^n, \, \sum a_i = 1 \Bigr\}.$

Which $a \in \mathbb{R}^n$ achieves this maximum, and what *is* the maximum
possible effective sample size? That’s easy, and in fact it’s something
that’s appeared many times at this blog before…

### The magnitude of a matrix

The **magnitude** $|R|$ of an invertible $n \times n$ matrix $R$ is the sum of
all $n^2$ entries of $R^{-1}$. To calculate it, you don’t need to go as
far as inverting $R$. It’s much easier to find the unique column vector
$w$ satisfying

$R w = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}$

(the **weighting** of $R$), then calculate $\sum_i w_i$. This sum is the
magnitude of $R$, since $w_i$ is the $i$th row-sum of $R^{-1}$.

Most of what I’ve written about magnitude has been in the situation where we start with a finite metric space $X = \{x_1, \ldots, x_n\}$, and we use the matrix $Z$ with entries $Z_{i j} = exp(-d(x_i, x_j))$. This turns out to give interesting information about $X$. In the metric situation, the entries of the matrix $Z$ are between $0$ and $1$. Often $Z$ is positive definite (e.g. when $X \subset \mathbb{R}^n$), as correlation matrices are.

When $R$ is positive definite, there’s a third way to describe the magnitude:

$|R| = \sup \Bigl\{ \frac{1}{a^\ast R a} \, : \, a \in \mathbb{R}^n, \, \sum a_i = 1 \Bigr\}.$

The supremum is attained just when $a = w/|R|$, and the proof is a simple application of the Cauchy–Schwarz inequality.

But that supremum is exactly the expression we had for maximum effective sample size! So:

The maximum possible value of $n_{eff}$ is $|R|$.

Or more wordily:

The maximum effective sample size of an unbiased linear estimator of the mean is the magnitude of the sample correlation matrix.

Or wordily but approximately:

Effective sample size $=$ magnitude of correlation matrix.

Moreover, we know how to attain that maximum. It’s attained if and only if our estimator is

$\frac{1}{|R|} (w_1 Y_1 + \cdots + w_n Y_n)$

where $w = (w_1, \ldots, w_n)$ is the weighting of the correlation matrix.

I’m not too sure where this “result” — observation, really — comes from. I learned it from the statistician Paul Blackwell at Sheffield, who, like me, had been reading this paper:

Andrew Solow and Stephen Polasky, Measuring biological diversity.

Environmental and Ecological Statistics1 (1994), 95–103.

In turn, Solow and Polasky refer to this:

Morris Eaton, A group action on covariances with applications to the comparison of linear normal experiments. In: Moshe Shaked and Y.L. Tong (eds.),

Stochastic inequalities: Papers from the AMS-IMS-SIAM Joint Summer Research Conference held in Seattle, Washington, July 1991, Institute of Mathematical Statistics Lecture Notes — Monograph Series, Volume 22, 1992.

But the result is so simple that I’d imagine it’s much older. I’ve been wondering whether it’s essentially the Gauss-Markov theorem; I thought it was, then I thought it wasn’t. Does anyone know?

### The surprising behaviour of effective sample size

You might expect the effective size of a sample of $n$ individuals to be at most $n$. It’s not.

You might expect the effective sample size to go down as the correlations within the sample go up. It doesn’t.

This behaviour appears in even the simplest nontrivial example:

ExampleSuppose our sample consists of just two individuals. Call the sampled values $Y_1$ and $Y_2$, and write the correlation matrix as $R = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}.$ Then the maximum-precision unbiased linear estimator is $\frac{1}{2}(Y_1 + Y_2)$, and its effective sample size is $|R| = \frac{2}{1 + \rho}.$ As the correlation $\rho$ between the two variables increases from $0$ to $1$, the effective sample size decreases from $2$ to $1$, as you’d expect.But when $\rho \lt 0$, the effective sample size is

greaterthan 2. In fact, as $\rho \to -1$, the effective sample size tends to $\infty$. That’s intuitively plausible. For if $\rho$ is close to $-1$ then, writing $Y_1 = \mu + \varepsilon_1$ and $Y_2 = \mu + \varepsilon_2$, we have $\varepsilon_1 \approx -\varepsilon_2$, and so $\frac{1}{2}(Y_1 + Y_2)$ is a very good estimator of $\mu$. In the extreme, when $\rho = -1$, it’s anexactestimator of $\mu$ — it’s infinitely precise.

The fact that the effective sample size can be greater than the actual sample size seems to be very well known. For instance, there’s a whole page about it in the documentation for Q, which is apparently “analysis software for market research”.

What’s interesting is that this doesn’t only occur when some of the variables are negatively correlated. It can also happen when all the correlations are nonnegative, as in the following example from the paper by Eaton cited above.

ExampleConsider the correlation matrix $R = \begin{pmatrix} 1 &0 &\rho \\ 0 &1 &\rho \\ \rho &\rho &1 \end{pmatrix}$ where $0 \leq \rho \lt \sqrt{2}/2 = 0.707\ldots$. This is positive definite, so it’s the correlation matrix of some random variables $Y_1, Y_2, Y_3$.A routine computation shows that $|R| = \frac{3 - 4\rho}{1 - 2\rho^2}.$ As we’ve shown, this is the greatest possible effective sample size you can achieve by taking an unbiased linear combination of $Y_1$, $Y_2$ and $Y_3$.

When $\rho = 0$, it’s $3$, as you’d expect: the variables are uncorrelated. As $\rho$ increases, $|R|$ decreases, again as you’d expect: more correlation between the variables leads to a smaller effective sample size. This behaviour continues until $\rho = 1/2$, where $|R| = 2$.

But then something strange happens. As $\rho$ increases from $1/2$ to $\sqrt{2}/2$, the effective sample size increases from $2$ to $\infty$.

Increasing the correlation increases the effective sample size.For instance, when $\rho = 0.7$, we have $|R| = 10$: the maximum-precision estimator is as precise as if we’d chosen $10$ independent individuals! For that value of $\rho$, the maximum-precision estimator turns out to be $\frac{3}{2} Y_1 + \frac{3}{2} Y_2 - 2 Y_3.$ Go figure!

This is very like the fact that a metric space with $n$ points can have magnitude (“effective number of points”) greater than $n$, even if the associated matrix $Z$ is positive definite.

These examples may seem counterintuitive, but Eaton cautions us to beware of our feeble intuitions:

These examples show that our rather vague intuitive feeling that “positive correlation tends to decrease information content in an experiment” is very far from the truth, even for rather simple normal experiments with three observations.

Anyone with any statistical knowledge who’s still reading will easily have picked up on the fact that I’m a total amateur. If that’s you, I’d love to hear your comments!

## Re: Effective Sample Size

I am also an amateur at statistics. However, on the question of how n positively correlated samples can have an effective sample size greater than n, I wonder how you can know what the true correlation matrix of your samples is. Presumably that knowledge is what somehow gets you the extra power of your experiment.