Effective Sample Size

December 18, 2014

Posted by Tom Leinster

$MathML-enabled post (click for more details).$

On a scale of 0 to 10, how much does the average citizen of the Republic of Elbonia trust the president?

You’re conducting a survey to find out, and you’ve calculated that in order to get the precision you want, you’re going to need a sample of 100 statistically independent individuals. Now you have to decide how to do this.

You could stand in the central square of the capital city and survey the next 100 people who walk by. But these opinions won’t be independent: probably politics in the capital isn’t representative of politics in Elbonia as a whole.

So you consider travelling to 100 different locations in the country and asking one Elbonian at each. But apart from anything else, this is far too expensive for you to do.

Maybe a compromise would be OK. You could go to 10 locations and ask… 20 people at each? 30? How many would you need in order to match the precision of 100 independent individuals — to have an “effective sample size” of 100?

The answer turns out to be closely connected to a quantity I’ve written about many times before: magnitude. Let me explain…

$MathML-enabled post (click for more details).$

The general situation is that we have a large population of individuals (in this case, Elbonians), and with each there is associated a real number (in this case, their level of trust in the president). So we have a probability distribution, and we’re interested in discovering some statistic $\theta$ (in this case, the mean, but it might instead be the median or the variance or the 90th percentile). We do this by taking some sample of $n$ individuals, and then doing something with the sampled data to produce an estimate of $\theta$ .

The “something” we do with the sampled data is called an estimator. So, an estimator is a real-valued function on the set of possible sample data. For instance, if you’re trying to estimate the mean of the population, and we denote the sample data by $Y_1, \ldots, Y_n$ , then the obvious estimator for the population mean would be just the sample mean,

$\frac{1}{n} Y_1 + \cdots + \frac{1}{n} Y_n.$

But it’s important to realize that the best estimator for a given statistic of the population (such as the mean) needn’t be that same statistic applied to the sample. For example, suppose we wish to know the mean mass of men from Mali. Unfortunately, we’ve only weighed three men from Mali, and two of them are brothers. You could use

$\frac{1}{3} Y_1 + \frac{1}{3} Y_2 + \frac{1}{3} Y_3$

as your estimator, but since body mass is somewhat genetic, that would give undue importance to one particular family. At the opposite extreme, you could use

$\frac{1}{2} Y_1 + \frac{1}{4} Y_2 + \frac{1}{4} Y_3$

(where $Y_1$ is the mass of the non-brother). But that would be going too far, as it gives the non-brother as much importance as the two brothers put together. Probably the best answer is somewhere in between. Exactly where in between depends on the correlation between masses of brothers, which is a quantity we might reasonably estimate from data gathered elsewhere in the world.

(There’s a deliberate echo here of something I wrote previously: in what proportions should we sow poppies, Polish wheat and Persian wheat in order to maximize biological diversity? The similarity is no coincidence.)

There are several qualities we might seek in an estimator. I’ll focus on two.

High precision The precision of an estimator is the reciprocal of its variance. To make sense of this, you have to realize that estimators are random variables too! An estimator with high precision, or low variance, is not much changed by the effects of randomness. It will give more or less the same answer if you run it multiple times.

For instance, suppose we’ve decided to do the Elbonian survey by asking 30 people in each of the 5 biggest cities and 20 people from each of 3 chosen villages, then taking some specific weighted mean of the resulting data. If that’s a high-precision estimator, it will give more or less the same final answer no matter which specific Elbonians happen to have been stopped by the pollsters.
Unbiased An estimator of some statistic is unbiased if its expected value is equal to that statistic for the population.

For example, suppose we’re trying to estimate the variance of some distribution. If our sample consists of a measly two individuals, then the variance of the sample is likely to be much less than the variance of the population. After all, with only two individuals observed, we’ve barely begun to glimpse the full variation of the population as a whole. It can actually be shown that with a sample size of two, the expected value of the sample variance is half the population variance. So the sample variance is a biased estimator of the population variance, but twice the sample variance is an unbiased estimator.

(Being unbiased is perhaps a less crucial property of an estimator than it might at first appear. Suppose the boss of a chain of pizza takeaways wants to know the average size of pizzas ordered. “Size” could be measured by diameter — what you order by — or area — what you eat. But since the relationship between diameter and area is quadratic rather than linear, an unbiased estimator of one will be a biased estimator of the other.)

No matter what statistic you’re trying to estimate, you can talk about the “effective sample size” of an estimator. But for simplicity, I’ll only talk about estimating the mean.

Here’s a loose definition:

The effective sample size of an estimator of the population mean is the number $n_{eff}$ with the property that our estimator has the same precision (or variance) as the estimator got by sampling $n_{eff}$ independent individuals.

Let’s unpack that.

Suppose we choose $n$ individuals at random from the population (with replacement, if you care). So we have independent, identically distributed random variables $Y_1, \ldots, Y_n$ . As above, we take the sample mean

$\frac{1}{n} Y_1 + \cdots + \frac{1}{n} Y_n$

as our estimator of the population mean. Since variance is additive for independent random variables, the variance of this estimator is

$n \cdot Var\Bigl( \frac{1}{n} Y_1 \Bigr) = n \cdot \frac{1}{n^2} Var(Y_1) = \frac{\sigma^2}{n}$

where $\sigma^2$ is the population variance. The precision of the estimator is, therefore, $n/\sigma^2$ . That makes sense: as your sample size $n$ increases, the precision of your estimate increases too.

Now, suppose we have some other estimator $\hat{\mu}$ of the population mean. It’s a random variable, so it has a variance $Var(\hat{\mu})$ . The effective sample size of the estimator $\hat{\mu}$ is the number $n_{eff}$ satisfying

$\sigma^2/n_{eff} = Var(\hat{\mu}).$

This doesn’t entirely make sense, as the unique number $n_{eff}$ satisfying this equation needn’t be an integer, so we can’t sensibly talk about a sample of size $n_{eff}$ . Nevertheless, we can absolutely rigorously define the effective sample size of our estimator $\hat{\mu}$ as

$n_{eff} = \sigma^2/\Var(\hat{\mu}).$

And that’s the definition. Differently put,

$\text{effective sample size} = \text{precision } \times \text{population variance}.$

Trivial examples If $\hat{\mu}$ is the mean value of $n$ uncorrelated individuals, then the effective sample size is $n$ . If $\hat{\mu}$ is the mean value of $n$ extremely highly correlated individuals, then the variance of the estimator is little less than the variance of a single individual, so the effective sample size is little more than $1$ .

Now, suppose our pollsters have come back from their trips to various parts of Elbonia. Together, they’ve asked $n$ individuals how much they trust the president. We want to take that data and use it to estimate the population mean — that is, the mean level of trust in the president across Elbonia — in as precise a way as possible.

We’re going to restrict ourselves to unbiased estimators, so that the expected value of the estimator is the population mean. We’re also going to consider only linear estimators: those of the form

$a_1 Y_1 + \cdots + a_n Y_n$

where $Y_1, \ldots, Y_n$ are the trust levels expressed by the $n$ Elbonians surveyed.

Question:

What choice of unbiased linear estimator maximizes the effective sample size?

To answer this, we need to recall some basic statistical notions…

Correlation and covariance

Variance is a quadratic form, and covariance is the corresponding bilinear form. That is, take two random variables $X$ and $Y$ , with respective means $\mu_X$ and $\mu_Y$ . Then their covariance is

$Cov(X, Y) = E((X - \mu_X)(Y - \mu_Y)).$

This is bilinear in $X$ and $Y$ , and $Cov(X, X) = Var(X)$ .

$Cov(X, Y)$ is bounded above and below by $\pm \sigma_X \sigma_Y$ , the product of the standard deviations. It’s natural to normalize, dividing through by $\sigma_X \sigma_Y$ to obtain a number between $-1$ and $1$ . This gives the correlation coefficient

$\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_X\sigma_Y} \in [-1, 1].$

Alternatively, we can first scale $X$ and $Y$ to have variance $1$ , then take the covariance, and this also gives the correlation:

$\rho_{X, Y} = Cov(X/\sigma_X, Y/\sigma_Y).$

Now suppose we have $n$ random variables, $Y_1, \ldots, Y_n$ . The correlation matrix $R$ is the $n \times n$ matrix whose $(i, j)$ -entry is $\rho_{Y_i, Y_j}$ . Correlation matrices have some easily-proved properties:

The entries are all in $[-1, 1]$ .
The diagonal entries are all $1$ .
The matrix is symmetric.
The matrix is positive semidefinite. That’s because the corresponding quadratic form is $(a_1, \ldots, a_n) \mapsto Var(\sum a_i Y_i/\sigma_i)$ , and variances are nonnegative.

And actually, it’s not so hard to prove that any matrix with these properties is the correlation matrix of some sequence of random variables.

In what follows, for simplicity, I’ll quietly assume that the correlation matrices we encounter are strictly positive definite. This only amounts to assuming that no linear combination of the $Y_i$ s has variance zero — in other words, that there are no exact linear relationships between the random variables involved.

Back to the main question

Here’s where we got to. We surveyed $n$ individuals from our population, giving $n$ identically distributed but not necessarily independent random variables $Y_1, \ldots, Y_n$ . Some of them will be correlated because of geographical clustering.

We’re trying to use this data to estimate the population mean in as precise a way as possible. Specifically, we’re looking for numbers $a_1, \ldots, a_n$ such that the linear estimator $\sum a_i Y_i$ is unbiased and has the maximum possible effective sample size.

The effective sample size was defined as $n_{eff} = \sigma^2/Var(\sum a_i Y_i)$ , where $\sigma^2$ is the variance of the distribution we’re drawing from. Now we need to work out the variance in the denominator.

Let $R$ denote the correlation matrix of $Y_1, \ldots, Y_n$ . I said a moment ago that $(a_1, \ldots, a_n) \mapsto Var (\sum a_i Y_i)$ is the quadratic form corresponding to the bilinear form represented by the covariance matrix. Since each $Y_i$ has variance $\sigma^2$ , the covariance matrix is just $\sigma^2$ times the correlation matrix $R$ . Hence

$Var(a_1 Y_1 + \cdots + a_n Y_n) = \sigma^2 \cdot a^\ast R a$

where $\ast$ denotes a transpose and $a = (a_1, \ldots, a_n)$ .

So, the effective sample size of our estimator is

$1/a^\ast R a.$

We also wanted our estimator to be unbiased. Its expected value is

$E(a_1 Y_1 + \cdots + a_n Y_n) = (a_1 + \cdots + a_n) \mu$

where $\mu$ is the population mean. So, we need $\sum a_i = 1$ .

Putting this together, the maximum possible effective sample size among all unbiased linear estimators is

$\sup \Bigl\{ \frac{1}{a^\ast R a} \, : \, a \in \mathbb{R}^n, \, \sum a_i = 1 \Bigr\}.$

Which $a \in \mathbb{R}^n$ achieves this maximum, and what is the maximum possible effective sample size? That’s easy, and in fact it’s something that’s appeared many times at this blog before…

The magnitude of a matrix

The magnitude $|R|$ of an invertible $n \times n$ matrix $R$ is the sum of all $n^2$ entries of $R^{-1}$ . To calculate it, you don’t need to go as far as inverting $R$ . It’s much easier to find the unique column vector $w$ satisfying

$R w = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix}$

(the weighting of $R$ ), then calculate $\sum_i w_i$ . This sum is the magnitude of $R$ , since $w_i$ is the $i$ th row-sum of $R^{-1}$ .

Most of what I’ve written about magnitude has been in the situation where we start with a finite metric space $X = \{x_1, \ldots, x_n\}$ , and we use the matrix $Z$ with entries $Z_{i j} = exp(-d(x_i, x_j))$ . This turns out to give interesting information about $X$ . In the metric situation, the entries of the matrix $Z$ are between $0$ and $1$ . Often $Z$ is positive definite (e.g. when $X \subset \mathbb{R}^n$ ), as correlation matrices are.

When $R$ is positive definite, there’s a third way to describe the magnitude:

$|R| = \sup \Bigl\{ \frac{1}{a^\ast R a} \, : \, a \in \mathbb{R}^n, \, \sum a_i = 1 \Bigr\}.$

The supremum is attained just when $a = w/|R|$ , and the proof is a simple application of the Cauchy–Schwarz inequality.

But that supremum is exactly the expression we had for maximum effective sample size! So:

The maximum possible value of $n_{eff}$ is $|R|$ .

Or more wordily:

The maximum effective sample size of an unbiased linear estimator of the mean is the magnitude of the sample correlation matrix.

Or wordily but approximately:

Effective sample size $=$ magnitude of correlation matrix.

Moreover, we know how to attain that maximum. It’s attained if and only if our estimator is

$\frac{1}{|R|} (w_1 Y_1 + \cdots + w_n Y_n)$

where $w = (w_1, \ldots, w_n)$ is the weighting of the correlation matrix.

I’m not too sure where this “result” — observation, really — comes from. I learned it from the statistician Paul Blackwell at Sheffield, who, like me, had been reading this paper:

Andrew Solow and Stephen Polasky, Measuring biological diversity. Environmental and Ecological Statistics 1 (1994), 95–103.

In turn, Solow and Polasky refer to this:

Morris Eaton, A group action on covariances with applications to the comparison of linear normal experiments. In: Moshe Shaked and Y.L. Tong (eds.), Stochastic inequalities: Papers from the AMS-IMS-SIAM Joint Summer Research Conference held in Seattle, Washington, July 1991, Institute of Mathematical Statistics Lecture Notes — Monograph Series, Volume 22, 1992.

But the result is so simple that I’d imagine it’s much older. I’ve been wondering whether it’s essentially the Gauss-Markov theorem; I thought it was, then I thought it wasn’t. Does anyone know?

The surprising behaviour of effective sample size

You might expect the effective size of a sample of $n$ individuals to be at most $n$ . It’s not.

You might expect the effective sample size to go down as the correlations within the sample go up. It doesn’t.

This behaviour appears in even the simplest nontrivial example:

Example Suppose our sample consists of just two individuals. Call the sampled values $Y_1$ and $Y_2$ , and write the correlation matrix as $R = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}.$ Then the maximum-precision unbiased linear estimator is $\frac{1}{2}(Y_1 + Y_2)$ , and its effective sample size is $|R| = \frac{2}{1 + \rho}.$ As the correlation $\rho$ between the two variables increases from $0$ to $1$ , the effective sample size decreases from $2$ to $1$ , as you’d expect.

But when $\rho \lt 0$ , the effective sample size is greater than 2. In fact, as $\rho \to -1$ , the effective sample size tends to $\infty$ . That’s intuitively plausible. For if $\rho$ is close to $-1$ then, writing $Y_1 = \mu + \varepsilon_1$ and $Y_2 = \mu + \varepsilon_2$ , we have $\varepsilon_1 \approx -\varepsilon_2$ , and so $\frac{1}{2}(Y_1 + Y_2)$ is a very good estimator of $\mu$ . In the extreme, when $\rho = -1$ , it’s an exact estimator of $\mu$ — it’s infinitely precise.

The fact that the effective sample size can be greater than the actual sample size seems to be very well known. For instance, there’s a whole page about it in the documentation for Q, which is apparently “analysis software for market research”.

What’s interesting is that this doesn’t only occur when some of the variables are negatively correlated. It can also happen when all the correlations are nonnegative, as in the following example from the paper by Eaton cited above.

Example Consider the correlation matrix $R = \begin{pmatrix} 1 &0 &\rho \\ 0 &1 &\rho \\ \rho &\rho &1 \end{pmatrix}$ where $0 \leq \rho \lt \sqrt{2}/2 = 0.707\ldots$ . This is positive definite, so it’s the correlation matrix of some random variables $Y_1, Y_2, Y_3$ .

A routine computation shows that $|R| = \frac{3 - 4\rho}{1 - 2\rho^2}.$ As we’ve shown, this is the greatest possible effective sample size you can achieve by taking an unbiased linear combination of $Y_1$ , $Y_2$ and $Y_3$ .

When $\rho = 0$ , it’s $3$ , as you’d expect: the variables are uncorrelated. As $\rho$ increases, $|R|$ decreases, again as you’d expect: more correlation between the variables leads to a smaller effective sample size. This behaviour continues until $\rho = 1/2$ , where $|R| = 2$ .

But then something strange happens. As $\rho$ increases from $1/2$ to $\sqrt{2}/2$ , the effective sample size increases from $2$ to $\infty$ . Increasing the correlation increases the effective sample size. For instance, when $\rho = 0.7$ , we have $|R| = 10$ : the maximum-precision estimator is as precise as if we’d chosen $10$ independent individuals! For that value of $\rho$ , the maximum-precision estimator turns out to be $\frac{3}{2} Y_1 + \frac{3}{2} Y_2 - 2 Y_3.$ Go figure!

This is very like the fact that a metric space with $n$ points can have magnitude (“effective number of points”) greater than $n$ , even if the associated matrix $Z$ is positive definite.

These examples may seem counterintuitive, but Eaton cautions us to beware of our feeble intuitions:

These examples show that our rather vague intuitive feeling that “positive correlation tends to decrease information content in an experiment” is very far from the truth, even for rather simple normal experiments with three observations.

Anyone with any statistical knowledge who’s still reading will easily have picked up on the fact that I’m a total amateur. If that’s you, I’d love to hear your comments!

Posted at December 18, 2014 10:25 PM UTC

TrackBack URL for this Entry: https://golem.ph.utexas.edu/cgi-bin/MT-3.0/dxy-tb.fcgi/2793

26 Comments & 1 Trackback

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

I am also an amateur at statistics. However, on the question of how n positively correlated samples can have an effective sample size greater than n, I wonder how you can know what the true correlation matrix of your samples is. Presumably that knowledge is what somehow gets you the extra power of your experiment.

Posted by: Jonathan Kirby on December 19, 2014 9:35 AM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

That’s a question I’ve wondered about myself.

I suppose one can never know the correlation, but one can take a good guess at it. Perhaps there’s a survey of trust in the Elbonian president taken annually, and although that trust level swings around wildly from year to year, the correlations within and between different towns remain about the same. In that case, it would be reasonable to assume that they’ll be about the same this year.

Or perhaps we know nothing about the mass of men in Mali, but we do know how well-correlated the masses of brothers tend to be in other countries, and we therefore feel it’s safe to assume that the correlation is similar there.

But I’d be happy if someone more knowledgeable gave their point of view.

Posted by: Tom Leinster on December 19, 2014 10:29 AM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Probably part of the story is that having a high $n_{\mathrm{eff}}$ doesn’t really guarantee that your sample is “statistically powerful”.

For one thing, notice in the two examples that Tom gave that as the magnitude tends to $\infty$ , the covariance matrix tends toward a singular matrix, for which no weighting exists. When no weighting exists, it seems that you can’t actually construct an unbiased estimator.

What if the magnitude is very large, but not infinite? In the 2-element sample, the weighting for covariance of $\rho$ is $[\frac 1 {1+\rho}, \frac 1 {1+\rho}]^T$ . So if you think that $\rho$ is close to $-1$ , but it could be off by $\epsilon$ , then all you know about the correct weighting is that it’s of the form $[\alpha, \alpha]^T$ for some $\alpha$ with $\frac 1 \epsilon\leq \alpha \leq \infty$ . So actually choosing a correct estimator is infeasible.

I’m trying to wrap my head around this intuitively – if a two-element sample identically distributed and perfectly anticorrelated, then their sum always gives the mean exactly, right? So why doesn’t $[1,1]^T$ come out as the optimal estimator?

Anyway, I’m guessing the connection between infinite $n_\mathrm{eff}$ and a singular covariance matrix is a general phenomenon. Having a very high $n_{\mathrm{eff}}$ probably goes hand in hand with having a nearly-singular covariance matrix and having a weighting which is very sensitive to perturbations in the matrix.

Posted by: Tim Campion on December 19, 2014 6:50 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Like the first example, you can think of the second example as you having the ability to cancel out noise. We can produce the second covariance matrix with

$Y_1 = \mu + \epsilon_1 \quad Y_2 = \mu + \epsilon_2 \quad Y_3 = \mu + \rho\epsilon_1 + \rho\epsilon_2 + \left(\sqrt{1 - 2\rho^2}\right)\epsilon_3$

where the $\epsilon_i$ s are independent with variance 1, mean 0. When the coefficient of $\epsilon_3$ is zero you can get a linear combination with just $\mu$ , and the covariance matrix is a sum of two rank 1 matrices.

Posted by: ap on December 20, 2014 3:20 AM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Thanks.

Let’s see if I understand.

When $\rho = \sqrt{2}/2$ (or, as you say, when $\sqrt{1 - 2\rho^2} = 0$ ), we get

$Y_3 = \mu + \rho\epsilon_1 + \rho\epsilon_2 = (1 - 2\rho)\mu + \rho Y_1 + \rho Y_2$

and so

$\mu = \frac{\rho Y_1 + \rho Y_2 - Y_3}{2\rho - 1} = \frac{Y_1 + Y_2 - \sqrt{2} Y_3}{2 - \sqrt{2}}.$

So if we know $\rho$ and $Y_1$ , $Y_2$ and $Y_3$ then we know $\mu$ .

What is the significance of this:

the covariance matrix is a sum of two rank 1 matrices

Posted by: Tom Leinster on December 20, 2014 9:52 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

In case anyone read ap’s comment and is wondering why the matrix

$R = \begin{pmatrix} 1&0&\rho\\ 0&1&\rho\\ \rho&\rho&1 \end{pmatrix}$

is the correlation matrix of

$Y_1 = \mu + \epsilon_1, \qquad Y_2 = \mu + \epsilon_2, \qquad Y_3 = \mu + \rho\epsilon_1 + \rho\epsilon_2 + \Bigl(\sqrt{1 - 2\rho^2} \Bigr) \epsilon_3$

where $\mu$ is any constant and the $\epsilon_i$ are independent with mean $0$ and variance $1$ , here’s the story.

I said in my post that any real positive semidefinite $n \times n$ matrix $R$ with $1$ s down the diagonal is the correlation matrix of some $n$ -tuple of random variables. The proof I know uses the fact that $R$ has a real symmetric square root $S$ . In fact, all that really matters is that there’s some real matrix $S$ satisfying $S S^t = R$ .

Now take independent random variables $\epsilon_1, \ldots, \epsilon_n$ , each with variance $1$ . Put “ $Y = S\epsilon$ ”, that is, define random variables $Y_1, \ldots, Y_n$ by

$Y_i = \sum_j S_{i j} \epsilon_j.$

Then it’s easy to show that $Y_1, \ldots, Y_n$ have correlation matrix $R$ .

(You can, if you want, add a constant $\mu_i$ to each $Y_i$ ; that doesn’t change their correlation matrix.)

Implicitly, ap used the matrix

$S = \begin{pmatrix} 1 &0 &0 \\ 0 &1 &0 \\ \rho &\rho &\sqrt{1 - 2\rho^2} \end{pmatrix}$

in defining $Y_1$ , $Y_2$ and $Y_3$ . A quick calculation shows that $S S^t$ is indeed the matrix $R$ defined at the start of this comment and in the last example of my post.

Posted by: Tom Leinster on December 20, 2014 10:08 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Hi Tim. In the 2-element example, the weighting is the transpose of $(\frac{1}{1 + \rho}, \frac{1}{1 + \rho})$ , so yes, that varies with $\rho$ . But the best estimator is

$\frac{1}{|R|}(w_1 Y_1 + w_2 Y_2),$

which (by calculation or simply by symmetry) is always $\frac{1}{2}(Y_1 + Y_2)$ , regardless of $\rho$ .

(When you said “if a two-element sample is identically distributed and perfectly anticorrelated, then their sum always gives the mean exactly”, you were out by a factor of 2.)

Knowing that two variables are strongly anticorrelated tells you a great deal, it seems. And surely related to that is that it’s rather hard to think of situations where you would know that variables were strongly anticorrelated.

Posted by: Tom Leinster on December 20, 2014 9:42 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Ah, I see that I made the very silly mistake of missing a factor of $\frac 1 {|R|}$ . Thanks for setting me straight.

One thing to notice is that if we drop the assumption that the variables are identically distributed, the power of anticorrelation goes away, intuitively. How much of this whole story survives if we do drop this assumption?

Posted by: Tim Campion on December 22, 2014 6:44 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Is it obvious that the actual value of the (anti-)correlation value changes its effectiveness as the distributions become different? Since you can show that the correlation for a signal is maximised/minimised by equal/negated version of the original signal (respectively). As such, as the distributions become more different the range of attainable correlation values is reduced. So the different distributions reduce the knowledge “through” how the correlation value behave; do the distributions of the random variables have any effect other than this?

Posted by: davetweed on December 22, 2014 11:35 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Very nice!

If the entries in the correlation coefficient are all nonnegative and we take their negative logarithms, do we get a metric space?

Posted by: Mike Shulman on December 20, 2014 6:35 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Thanks!

The answer to your question is no. Take the matrix mentioned at the end of the post,

$R = \begin{pmatrix} 1 &0 &\rho \\ 0 &1 &\rho \\ \rho &\rho &1 \end{pmatrix}$

where $0 \lt \rho \lt \sqrt{2}/2$ . This is positive definite and has $1$ s down the diagonal, so its a correlation matrix. (Indeed, ap’s comment gives an explicit construction of some random variables that it’s the correlation matrix of.) But if it came from a metric space in the way you describe, it would satisfy a version of the triangle inequality:

$R_{1 2} \geq R_{1 3} R_{3 2},$

which is false. (More intuitively, the “ $0$ ” in the $(1, 2)$ position says that the 1st and 2nd points are infinitely far apart, whereas the “ $\rho$ “s at $(1, 3)$ and $(2, 3)$ say that both the 1st and 2nd points are at finite distance from the 3rd point.)

Posted by: Tom Leinster on December 20, 2014 9:34 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Note that there’s another problem to surmount: the correlation between the random variable $X$ and $X+c$ is 1, so any transformation that maps that to 0 will violate the “distance zero means equal” condition (unless you possibly redefine what equal means).

Posted by: dave tweed on December 20, 2014 9:53 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Yes, good point: having correlation $1$ doesn’t mean being identical.

Somewhat relatedly, having correlation $0$ is a much weaker condition than being independent.

The Wikipedia page on uncorrelated random variables has a nice example (which I guess is standard). Let $X$ be distributed uniformly on $[-1, 1]$ and $Y = X^2$ . Then $X$ and $Y$ are not independent, to say the least! But their correlation coefficient is zero.

Roughly, the reason they’re uncorrelated is that an increase in $Y$ is equally likely to have been produced by an increase or a decrease in $X$ . E.g. if we know that $Y$ has changed from $0.3$ to $0.31$ , then that means that $X$ has either changed from ${}^+\sqrt{0.3}$ to ${}^+\sqrt{0.31}$ or changed from ${}^-\sqrt{3}$ to ${}^-\sqrt{0.31}$ , and the two possibilities are equally probable.

Posted by: Tom Leinster on December 20, 2014 10:17 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Well, but after Lawvere we all know there’s no reason to demand metric spaces to be skeletal, right? (-:

If in some case we do get a (not necessarily skeletal) metric space, does that say anything interesting about the random variables we started with?

Posted by: Mike Shulman on December 21, 2014 8:25 PM | Permalink | Reply to this

Re: Effective Sample Size

These examples show that our rather vague intuitive feeling that “positive correlation tends to decrease information content in an experiment” is very far from the truth, even for rather simple normal experiments with three observations.

The way I justified this observation to myself back in the nineties was that when the variables are correlated to an unknown degree, there is actual information hidden in the difference between a sampled value of a variable and the expected value based on the assumed correlation and the sampled values of other variables. In the limit when the correlation is 1.0, any deviation at all would produce a numerically infinite information value given that the sampled value is supposedly impossible.

In such cases I found it made intuitive sense to treat such extra information as pertaining to the the correlation itself and tweak that to minimize the effect.

Posted by: Jouni Kosonen on December 20, 2014 11:30 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

I’ve been turning this over in my mind in the last 24 hours or so, and I think I kind of get what you mean, but it’s fuzzy.

One point is that we don’t see this effect with two positively correlated variables. There, the effective sample size is $2/(1 + \rho)$ , where $\rho$ is the correlation coefficient. This decreases as $\rho \to 1$ .

Any explanation needs to account for why the effect isn’t seen until $n = 3$ . Do you have an intuition as to why that is?

Posted by: Tom Leinster on December 21, 2014 8:13 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Sorry for the long delay, I forgot I actually posted that.

Any explanation needs to account for why the effect isn’t seen until $n = 3$ . Do you have an intuition as to why that is?

An intuition, nothing more. For two points in a metric manifold, a single number is sufficient to represent the distance between two points. For three points, the sum of pairwise distances (the perimeter of the triangle) can be used in the same way but this ends up ignoring the described area that carries information about the separation of the points as well. For four or more points the informational value of the single scalar drops more as the dimensionality of the ignored information rises.

I posit that the underlying assumption that a real-valued correlation factor is a good choice for three or more variables is false and loses information about the nature of the correlation itself.

Posted by: Jouni Kosonen on January 24, 2015 12:04 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

I was thinking about an effective-sample-size-like notion the other day.

Shine a laser at a rough surface and you see a speckle pattern like this.

The intensity at each point can be modelled as the sum of many Gaussian variables. But if you look at the intensity at point A close to point B they are correlated. The distance from A to B has to be the size of a couple of speckle “lumps” before the correlation is small. So if you’re looking at some area with a speckle pattern on it, it makes intuitive sense to talk of an effective number of independent variables per unit area underlying that pattern. I’m not sure if this can be carried through rigorously but it seems related to what you’re talking about.

One reason I mention this is that you can think of speckle as emerging from a Feynman path integral. The speckle pattern arises from the statistics of summing over many paths from light source to surface to eye, each with a different phase. So this may connect back to notions of size mentioned way back on the n-category cafe.

Posted by: Dan Piponi on December 22, 2014 10:39 PM | Permalink | Reply to this

Making the story fit the math

$MathML-enabled post (click for more details).$

In your analysis you require the variables $Y_1,\ldots,Y_n$ to be identically distributed to the distribution of interest. To make the Elbonia surveying story fit this assumption, you’d have to send each of your surveyors to a randomly chosen region of the country, but in such a manner that the probability of a region getting a surveyor is proportional to the region’s population. (Otherwise people from regions of low population density would exert an undue influence on the results.) Then each surveyor would be instructed to measure a number of people in their assigned region (presumably with known correlation coefficients among those measurements).

Posted by: Axel Boldt on December 24, 2014 7:58 PM | Permalink | Reply to this

Re: Making the story fit the math

$MathML-enabled post (click for more details).$

Actually, I asked a bit more than I needed. It would have been enough to ask that $Y_1, \ldots, Y_n$ have the same mean and variance. (The latter condition goes by the superb name of homoscedasticity, I recently learned.) But I’m not sure that makes a substantial difference.

Posted by: Tom Leinster on December 29, 2014 12:23 AM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

I just got around to reading this post. I hope to find the time to give it more thought sometime soon, but in the meantime I have a comment on one small part:

You might expect the effective size of a sample of $n$ individuals to be at most $n$ . It’s not.

Personally, I wouldn’t expect this. Here’s why: Saying that the effective sample size is $k$ means that, in some sense, it gives you the same amount of information about the underlying distribution as a sample of $k$ independent individuals. The thing is, independent samples are by no means the best possible for learning a distribution. It’s better if each individual strikes some balance between being typical and being as different as possible from the previously sampled individuals. (The precise meanings of “better”, “some balance”, “typical”, and “as different as possible” all depend on each other, of course.)

For example, say $Y$ is uniformly distributed in $\{1, \ldots , N\}$ . A best possible sample would be if $(Y_1, \ldots, Y_N)$ is a uniformly chosen permutation of $\{1, \ldots, N\}$ . These are very much not independent. Coming at this from the opposite direction, if $Y_1, \ldots, Y_n$ are independent and uniformly chosen from $\{1, \ldots, N\}$ , then you need $n \approx N \log N$ even to expect to see all the $N$ possible values of this distribution. (This is a classic problem in probability called the coupon collector’s problem.)

Posted by: Mark Meckes on December 29, 2014 7:25 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Isn’t it great how trainable intuition is? Isn’t it great talking to other people whose intuition is trained in directions that your own isn’t?

Your mathematical point reminds me of the following story. In the early days of the iPod, Apple were inundated with complaints that the shuffle function wasn’t truly random. Everyone kept telling them how songs by the same artiste would clump together: one Madonna song would usually be followed by another, and so on.

They had their technical people check the algorithm, and it turned out that nothing was wrong with it. All that was wrong was people’s perception of randomness. So they changed the algorithm to forbid clumping — making it less random in order to persuade humans that it was more random.

Posted by: Tom Leinster on December 30, 2014 4:39 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Among other things, there’s a terminological problem highlighted by the anecdote about iPods, and my sympathies lie more with the users.

Strictly speaking, any way of picking something is random — even a constant is a random variable, albeit a boring one. (As usual, xkcd has a great comment on this issue.) The trouble is that many people, including many professional probabilists who secretly know better, use “random” to mean something much stronger. Typically, a probabilist will say that $X$ is “random” in a set $\Omega$ if $X$ is uniformly distributed in $\Omega$ (assuming we’re in a context in which that even means anything), and that a sequence $X_1, X_2, \ldots$ is “random” if it is a sequence of independent (and uniform, if applicable) random variables.

Now independent sequences of random variables are a reasonable model of many real-world phenomena, and it’s true that people have very poor intuition about how such sequences behave. In particular, people underestimate how common clumping is. Among other things, this contributes to people’s tendency to ascribe winning streaks in sports or gambling to something other than a perfectly ordinary side effect of randomness. (I understand that careful studies by statisticians of sports statistics have found that “hot streaks”, about which many professional athletes have cherished superstitions, happen about as often and last about as long as independent-random-variable models would predict.)

On the other hand, this by no means means that a “random” selection of songs ought to be chosen with independent picks. It’s perfectly reasonable that a shuffle function ought to behave in a way that matches users’ intuition about randomness better than independent random variables. To make a semi-concrete proposal, if $X_1, X_2, \ldots$ are the song choices, a good shuffle algorithm ought to result in the empirical measures $\frac{1}{n} \sum_{i=1}^n \delta_{X_i}$ being good approximations of the uniform measure for large $n$ . In fact the classical Glivenko–Cantelli theorem says that this will be the case for independent picks, but the approximation will not be the best possible.

So from my point of view, the initial choice of an algorithm that chose successive tracks independently was a design flaw, albeit one that would probably be made by any other company.

Posted by: Mark Meckes on January 2, 2015 1:20 AM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Related to this point, I’d be very interested to find some perspective that makes sense of the possibility that a metric space has magnitude greater than its cardinality. Thinking about that might help clarify what the magnitude of a metric space means.

Posted by: Mark Meckes on January 3, 2015 5:01 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

OK, so what we could do is:

take a positive definite metric space $X$ with magnitude larger than its cardinality (such as $0.35 K_{3, 2}$ , where $K_{3, 2}$ is a complete bipartite graph, as in Example 2.4.11 of `The magnitude of metric spaces’)
work out some string of $n$ random variables whose correlation matrix is the similarity matrix of $X$ (which we know is possible)
understand why the effective sample size represented by those $n$ random variables is greater than $n$
use that understanding to improve our understanding of why metric magnitude can be greater than cardinality.

In the example I cited, the phenomenon of magnitude greater than cardinality only shows up at a very narrow range of scales. Specifically, it’s only for scale factors inside the range 0.345 to 0.355. So understanding why it happens at all may be difficult.

Nevertheless, it might be possible. As you know, what’s going on here is that the magnitude function $t \mapsto |t K_{3, 2}|$ has a singularity at $t = log(2)/2 \approx 0.347$ . Just to the left of that singularity, the magnitude tends to $-\infty$ , and just to the right, it tends to $+\infty$ .

Posted by: Tom Leinster on January 6, 2015 8:52 PM | Permalink | Reply to this

Re: Effective Sample Size

$MathML-enabled post (click for more details).$

Four years late but I just noticed no comments mention antithetic sampling. Well, now one does.

Posted by: Dan Piponi on November 28, 2018 12:09 AM | Permalink | Reply to this

Read the post 100 Papers on Magnitude
Weblog: The n-Category Café
Excerpt: To celebrate the 100th paper on magnitude, a quick rundown of what's happening in the world of magnitude and which areas are undeservedly underexplored
Tracked: June 14, 2024 11:24 PM

The n-Category Café

Skip to the Main Content

December 18, 2014