Abstract

This case study illustrates the so-called “curse of dimensionality,” starting from scratch and using simple examples based on simulation. For example, generating points uniformly at random in a unit hypercube, it is easy to see how the average distance between random points increases with dimensionality. Similarly, such points tend to increasingly fall outside the hypersphere inscribed within the unit hypercube, showing that in higher dimensions, almost all of the volume is in the corners of the hypercube.

Similarly, generating standard normal variates (zero mean, unit standard deviation) leads to draws that concentrate in a thin shell of increasing distance from the mean as dimensionality increases. The squared distance of a standard normal draw from the mode follows a standard chi-square distribution with degrees of freedom equal to the dimensionality. This allows the the precise bounds of the thin shell to be calculated using tail statistics, showing just how unlikely it is to get near the mode with a random draw from a standard multivariate normal.

These properties of the standard normal distribution provides a segue into discussing the important information-theoretic concept of the typical set for a probability distribution, which is roughly defined as the central log density band into which almost all random draws from that distribution will fall. The rather surprising result is that for a normal distribution in high dimensions, the typical set does not include the volume around the mode. The density is highest around the mode, but the volume is low, and the probability mass is determined as the product of the density and the volume (or more precisely, the integral of the density over the volume).

The standard normal log density is just negative squared Euclidean distance, which provides a straightforward demonstration of why the maximum likelihood estimate for a normal regression is identical to the least squares solution.


1. Euclidean Length and Distance

Euclidean Length

Consider a vector \(y = (y_1, \ldots, y_N)\) with \(N\) elements (we call such vectors \(N\)-vectors). The Euclidean length of a vector \(y\) is written as \(|\!| y |\!|\), and defined by a generalizing the Pythagorean theorem,

\[ |\!| y |\!| \ = \ \sqrt{y_1^2 + y_2^2 + \cdots + y_N^2}. \]

In words, the length of a vector is the square root of the sum of the squares of its elements.

If we take \(y = (y_1, \ldots, y_N)\) to be a row vector, then we see that the dot product of a vector with itself is its squared length, so that

\[ |\!| y |\!| = \sqrt{y \, y^{\top}}. \]

Calculating Vector Length in R

The following function computes vector length in R.

euclidean_length <- function(u) sqrt(sum(u * u));

In R, the operator * operates elementwise rather than as vector multiplication. To test the function on a simple case, we can verify the first example of the Pythagorean theorem everyone learns in school, namely \(|\!| (3, 4) |\!| = 5\).

euclidean_length(c(3, 4));
[1] 5

In R, the function length() returns the number of elements in a vector rather than the vector’s Euclidean length.

length(c(3, 4));
[1] 2

Euclidean Distance

The Euclidean distance between two \(N\)-vectors, \(x = (x_1, \ldots, x_N)\) and \(y = (y_1, \ldots, y_N)\), written \(\mathrm{d}(x,y)\), is the Euclidean length of the vector \(x - y\) connecting them,

\[ \mathrm{d}(x, y) \ = \ |\!| x - y |\!| \ = \ \sqrt{(x_1 - y_1)^2 + \cdots + (x_N - y_N)^2}. \]

2. All of the Volume is in the Corners

Suppose we have a square and inscribe a circle in it, or that we have a cube and a sphere inscribed in it. When we extend this construction to higher dimensions, we get hyperspheres inscribed in hypercubes. This section illustrates the curious fact that as the dimensionality grows, most of the points in the hypercube lie outside the inscribed hypersphere.

Suppose we have an \(N\)-dimensional hypercube, with unit-length sides centered around the origin \(\mathbf{0} = (0, \ldots, 0)\). The hypercube will have \(2^N\) corners at the points \(\left( \pm \frac{1}{2}, \ldots, \pm \frac{1}{2} \right)\). Because its sides are length 1, it will have also have unit volume, because \(1^N = 1\).

If \(N=1\), the hypercube is a line from \(-\frac{1}{2}\) to \(\frac{1}{2}\) of unit length (i.e., length 1). If \(N=2\), the hypercube is a square of unit area with opposite corners at \(\left( -\frac{1}{2}, -\frac{1}{2} \right)\) and \(\left( \frac{1}{2}, \frac{1}{2} \right)\). With \(N=3\), we have a cube of unit volume, with opposite corners at \(\left( -\frac{1}{2}, -\frac{1}{2}, -\frac{1}{2} \right)\) and \(\left( \frac{1}{2}, \frac{1}{2}, \frac{1}{2} \right)\), and unit volume. And so on up the dimensions.

Now consider the biggest hypersphere you can inscribe in the hypercube. It will be centered at the origin and have a radius of \(\frac{1}{2}\) so that it extends to the sides of the hypercube. A point \(y\) is within this hypersphere if the distance to the origin is less than the radius, or in symbols, if \(|\!|y|\!| < \frac{1}{2}\). Topologically speaking, we have defined what is known as an open ball, i.e., the set of points within a hypersphere excluding the limit points at distance exactly \(\frac{1}{2}\) (we could’ve worked with closed balls which include the limit points making up the surface of the ball because this surface (a hypersphere) has zero volume).

In one dimension, the hypersphere is just the line from \(-\frac{1}{2}\) to \(\frac{1}{2}\) and contains the entire hypercube. In two dimensions, the hypersphere is a circle of radius \(\frac{1}{2}\) centered at the origin and extending to the center of all four sides. In three dimensions, it’s a sphere that just fits in the cube, extending to the middle of all six sides.

Monte Carlo Casino Monte Carlo Casino. From Wikipedia, with license CC BY 2.5.

The Monte Carlo Method for Integration

We know the volume of the unit hypercube is one, but what is the volume of the ball in the inscribed hypersphere? You may have learned how to define an integral to calculate the answer for a ball of radius \(r\) in two dimensions as \(\pi r^2\), and may even recall that in three dimensions it’s \(\frac{4}{3}\pi r^3\). But what about higher dimensions?

We could evaluate harder and harder multiple integrals, but we’ll instead use the need to solve these integrals as an opportunity to introduce the fundamental machinery of using sampling to calculate integrals. Such methods are called “Monte Carlo methods” because they involve random quantities and there is a famous casino in Monte Carlo. They are at the very heart of modern statistical computing (Bayesian and otherwise).

Monte Carlo integration allows us to calculate the value of a definite integral by averaging draws from a random number generator. (Technically, the random number generators we have on computers, like used in R, are pseudorandom number generators in the sense that they’re underlyingingly deterministic; for the sake of argument, we assume they are random enough for our purposes in much the way we assume the functions we’re dealing with are smooth enough).

To use Monte Carlo methods to calculate the volume within a hypersphere inscribed in a hypercube, we need only generate draws at random in the hypercube and count the number of draws that fall in the hypersphere—our answer is the the proportion of draws that fall in the hypersphere. That’s because we deliberately constructed a hypercube of unit volume; in general, we need to multiply by the volume of the set over which we are generating uniform random draws.

As the number of draws increases, the estimated volume converges to the true volume. Because the draws are i.i.d., it follows from the central limit theorem that the error goes down at a rate of \(\mathcal{O}\left( 1 / \sqrt{n} \right)\). That means each additional decimal place of accuracy requires multiplying the sample size by one hundred. We can get rough answers with Monte Carlo methods, but many decimal places of accuracy requires a prohibitive number of simulation draws.

We can draw the points at random from the hypercube by drawing each component independently according to

\[ y_n \sim \mathsf{Uniform}\left(-\frac{1}{2}, \frac{1}{2}\right). \] Then we count the proportion of draws that lie within the hypersphere. Recall that a point \(y\) lies in the hypersphere if \(|\!| y |\!| < \frac{1}{2}\).

Using Monte Carlo methods to compute π

We’ll first look at the case where \(N = 2\), just to make sure we get the right answer. We know the area inside the inscribed circle is \(\pi r^2\), so with \(r = \frac{1}{2}\), that’s \(\frac{\pi}{4}\). Let’s see if we get the right result.

N <- 2;
M <- 1e6;
y <- matrix(runif(M * N, -0.5, 0.5), M, N);
p <- sum(sqrt(y[ , 1]^2 + y[ , 2]^2) < 0.5) / M;
print(4 * p, digits=3);
[1] 3.14
print(pi, digits=3);
[1] 3.14

Now, let’s generalize and calculate the volume of the hypersphere inscribed in the unit hypercube (which has unit volume by construction) for increasing dimensions.

M <- 1e5;
N_MAX = 10;
Pr_inside <- rep(NA, N_MAX);
for (N in 1:N_MAX) {
  y <- matrix(runif(M * N, -0.5, 0.5), M, N);
  inside <- 0;
  for (m in 1:M) {
    if (euclidean_length(y[m,]) < 0.5) {
      inside <- inside + 1;
    }
  }
  Pr_inside[N] <- inside / M;
}
df = data.frame(dims = 1:N_MAX, volume = Pr_inside);
print(df, digits=1);
   dims volume
1     1  1.000
2     2  0.784
3     3  0.524
4     4  0.305
5     5  0.165
6     6  0.080
7     7  0.036
8     8  0.016
9     9  0.007
10   10  0.002

Although we actually calculate the probability that a point drawn at random is inside the hyperphere inscribed in the unit hypercube, this quantity gives the volume inside the inscribed hypersphere.

Here’s the result as a plot.

library(ggplot2);
plot_corners <-
  ggplot(df, aes(x = dims, y = Pr_inside)) +
  scale_x_continuous(breaks=c(1, 3, 5, 7, 9)) +
  geom_line(colour="gray") +
  geom_point() +
  ylab("volume of inscribed hyperball") + 
  xlab("dimensions") +
  ggtitle("Volume of Hyperball Inscribed in Unit Hypercube")
plot_corners;

All points are far away from each other

Another way of looking at this is that in higher dimensions, the points on average become further and further away from the center of the hypercube. They also become further and further away from each other (see the exercises). The volume of the hypersphere is the proportion of points in the hypercube that are within distance \(\frac{1}{2}\) from the center of the hypercube; with inreasing dimension, vanishingly few points are within distance \(\frac{1}{2}\) of the center of the hypercube.

As Bernhard Schölkopf said on Twitter, “a high-dimensional space is a lonely place.” What he meant by this is that not only are the points increasingly far from the mean on average, they are also increasingly far from each other. As Michael Betancourt noted in a comment on the pull request for this case study, “Take two points that are translated from each other by 1 unit in each direction — the distance between them grows as \(\sqrt{N}\) so as the dimension grows the points appear further from each other, with more and more volume filling up the space between them.”

3. Typical Sets

Roughly speaking, the typical set is where draws from a given distribution tend to lie. That is, it covers most of the mass of the distribution. This particular choice of volume not only covers most of the mass of a distribution, its members reflect the properties of draws from that distribution. What we will see in this section is that the typical set is usually nowhere near the mode of the distribution, even for a multivariate normal (where the mode is the same as the mean).

A Discrete Example of Typicality

Typicality is easiest to illustrate with a binomial example. Consider a binary trial with an eighty percent chance of success (i.e., draws from a \(\mathsf{Bernoulli}(0.80)\) distribution). Now consider repeating such a trial one hundred times, drawing \(y_1, \ldots, y_{100}\) independently according to

\[ y_n \sim \mathsf{Bernoulli}(0.8). \]

Two questions frame the discussion:

  1. What is the most likely value for \(y_1, \ldots, y_{100}\)?

  2. How many \(y_n\) do you expect to be 1?

Let’s answer the second question first. If you have an 80% chance of success and make 100 attemps, the expected number of successes is just 80 (in general, it’s the probability of success times the number of attempts).

Then why is the most likely outcome 100 straight successes? The probability of 100 straight successes is \(0.8^{100}\) or about \(2 \times 10^{-10}\).

In contrast, the probability of any given sequence with 80 successes is only \(0.8^{80} \, 0.2^{20}\), or about \(2 \times 10^{-22}\). The chance of 100 straight successes is a whopping \(10^{12}\) times more probable than any specific sequence with 80 successes and 20 failures!

On the other hand, there are a lot of sequences with 80 successes and 20 failures—a total of \(\binom{100}{80}\) of them to be exact (around \(10^{20}\)). So even though any given sequence of 80 success and 20 failures is relatively improbable, there are so many such sequences that their overall mass is higher than that of the unique sequence with 100 success, because \(10^{20} \times 2 \times 10^{-22}\) is a lot bigger than \(1 \times 2 \times 10^{-10}\). Same goes for sequences with 81 successes; each such sequence is more probably than any sequence with 80 successes, but \(\binom{100}{81}\) is smaller than \(\binom{100}{80}\).

The binomial distribution is the distribution of counts, so that if \(y_1, \ldots, y_N\) is such that each \(y_n \sim \mathsf{Bernoulli}(\theta)\), then

\[ (y_1 + \cdots + y_N) \sim \mathsf{Binomial}(N, \theta). \]

The binomial aggregates the multiple trials by multiplying through by the possible ways in which \(y\) successes can arise in \(N\) trials, namely

\[ \mathsf{Binomial}(y \mid N, \theta) = \binom{N}{y} \, \theta^y \, (1 - \theta)^{(N - y)}, \]

where the binomial coefficient that normalizes the distribution is defined as the number of binary sequences of length \(N\) that contain exactly \(y\) elements equal to 1,

\[ \binom{N}{y} = \frac{N!}{y! \, (N - y)!}. \]

In words, \(\mathsf{Binomial}(y \mid N, \theta)\) is the probability of \(y\) successes in \(N\) independent Bernoulli trials if each trial has a chance of success \(\theta\).

To make sure we’re right in expecting around 80 succeses, we can simulate a million binomial outcomes from 100 trials with an 80% chance of success as

z <- rbinom(1e6, 100, 0.8);

with a summary

summary(z);
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      59      77      80      80      83      96

and 99% interval

quantile(z, probs=c(0.005, 0.995));
##  0.5% 99.5% 
##    69    90

The most probable outcome as a sequence of Bernoulli trials, namely \((1, 1, \ldots, 1)\), has a very improbable number of successes, namely \(N\). A much more typical number of successes is 80. In fact, 100 isn’t even in the 99% interval of 100 trials with an 80% success rate. We can see that analytically using the quantile (inverse cumulative distribution function). Suppose we have a random variable \(Y\) with mass or density function \(p_Y(y)\). Then the cumulative distribution function (CDF) is defined by

\[ F_Y(u) \ = \ \mbox{Pr}[Y \leq u]. \]

For discrete probability (mass) functions, this works out to

\[ F_Y(u) \ = \ \sum_{y \, = \, -\infty}^u p_Y(y), \]

and for continuous probability (density) functions,

\[ F_Y(u) = \int_{-\infty}^u p_Y(y) \, \mbox{d}y. \]

What we are going to need is the inverse CDF, \(F_Y^{-1}\), or quantile function, which maps a quantile \(\alpha \in (0, 1)\) to the value \(y\) such that \(\mbox{Pr}[Y \leq y] = \alpha\) (this needs to be rounded in the discrete case to deal with the the fact that the inverse CDF is only technically defined for a countable number of quantiles).

Luckily, this is all built into R, and we can calculate the quantiles that bound the central 99.9999% interval,

qbinom(c(0.0000005, 0.9999995), 100, 0.8)
## [1] 59 97

This tells us that 99.9999% of the draws (i.e, 999,999 out of 1,000,000 draws) from \(\mathsf{Binomial}(100, 0.8)\) lie in the range \((59, 97)\). This demonstrates just how atypical the most probable sequences are.

We next reproduce a graphical illustration from David J. C. MacKay’s Information Theory, Inference, and Learning Algorithms (Cambridge, 2003, section 4.4) of how the typical set of the binomial arises as a product of

  1. the Bernoulli trial probability of a sequence with a given number of successes, and

  2. the number of sequences with that many successes.

Suppose we have a binary sequence of \(N\) elements, \(z = z_1, \ldots, z_N\), with \(z_n \in \{ 0, 1 \}\) and let

\[ y = \sum_{n=1}^N z_n \]

be the total number of successes. The repeated Bernoulli trial probability of \(z\) with a chance of success \(\theta \in [0, 1]\) is given by the probability mass function

\[ p(z \mid \theta) \ = \ \prod_{n=1}^N \mathsf{Bernoulli}(z_n \mid \theta) \ = \ \theta^y \, (1 - \theta)^{(N - y)}. \]

The number of ways in which a binary sequence with a total of \(y\) successes out of \(N\) trials can arise is the total number of ways a subset of \(y\) elements may be chosen out of a set of \(N\) elements, which is given by the binomial coefficient,

\[ \binom{N}{y} = \frac{N!}{y! \, (N - y)!}. \]

The following plots show how the binomial probability mass function arises as a product of the binomial coefficient and the probability of a single sequence with a given number of successess. The left column contains plots on the linear scale and the right column plots on the log scale. The top plots are of the binomial coefficient, \(\binom{N}{y}\). Below that is the plot the probability of a single sequence with \(y\) successes, namely \(\theta^y \, (1 - \theta)^{N-y}\). Below both of these plots is their product, the binomial probability mass function \(\mathsf{Binomial}(y \mid N, \theta)\). There are three sequences of plots, for \(N = 25\), \(N = 100\), and \(N=400\).

Just to get a sense of scale, there are \(2^{25}\) (order \(10^7\)) binary sequences of length 25, \(2^{100}\) (order \(10^{30}\)) binary sequences of length 100, and \(2^{400}\) (order \(10^{120}\)) binary sequences of length 400.

choose_df <- function(Ys, theta=0.2) {
  N <- max(Ys);
  Ns <- rep(N, length(Ys));
  Cs <- choose(N, Ys);
  Ls <- theta^Ys * (1 - theta)^(N - Ys);
  Ps <- Cs * Ls;
  data.frame(list(y = Ys, N = Ns, combos = Cs, L = Ls, P = Ps));
}

choose_plot <- function(df, logy = FALSE) {
  p <-
    ggplot(df, aes(x = y, y = combos)) +
    geom_line(size=0.2, color="darkgray") +
    geom_point(size=0.4) +
    scale_x_continuous() +
    xlab("y");
  if (logy) {
    p <- p + scale_y_log10() +
      ylab("log (N choose y)") +
      theme(axis.title.y = element_text(size=8),
            axis.text.y=element_blank(),
            axis.ticks.y=element_blank())
      ggtitle("sequence log permutations ...");
  } else {
    p <- p + scale_y_continuous() +
      ylab("(N choose y)") +
      theme(axis.title.y = element_text(size=8),
            axis.text.y=element_blank(),
            axis.ticks.y=element_blank())
      ggtitle("sequence permutations ...");
  }
  return(p);
}
seq_plot <- function(df, logy = FALSE) {
  p <-
    ggplot(df, aes(x = y, y = L)) +
    geom_line(size=0.2, color="darkgray") +
    geom_point(size=0.4) +
    scale_x_continuous() +
    xlab("y");
    if (logy) {
      p <- p + scale_y_log10() +
           ylab("log theta^y * (1 - theta)^(N - y)") +
           theme(axis.title.y = element_text(size=8),
                 axis.text.y=element_blank(),
                 axis.ticks.y=element_blank())
           ggtitle("plus sequence log probability ...");
    } else {
      p <- p + scale_y_continuous() +
           ylab("theta^y * (1 - theta)^(N - y)") +
           theme(axis.title.y = element_text(size=8),
                 axis.text.y=element_blank(),
                 axis.ticks.y=element_blank())
           ggtitle("times sequence probability ...");
  }
  return(p);
}
joint_plot <- function(df, logy = FALSE) {
  p <- ggplot(df, aes(x = y, y = P)) +
    geom_line(size = 0.25, color = "darkgray") +
    geom_point(size = 0.25) +
    scale_x_continuous() +
    xlab("y");
  if (logy) {
    p <- p + scale_y_log10() +
           ylab("log binom(y | N, theta)") +
          theme(axis.title.y = element_text(size=8),
                axis.text.y=element_blank(),
                axis.ticks.y=element_blank())
           ggtitle("equals count log probability");
  } else {
    p <- p + scale_y_continuous() +
          ylab("binom(y | N, theta)") +
          theme(axis.title.y = element_text(size=8),
                axis.text.y=element_blank(),
                axis.ticks.y=element_blank())
          ggtitle("equals count probability");
  }
  return(p);
}
library("grid")
library("gridExtra");
plot_all <- function(df) {
  cp <- choose_plot(df, logy = FALSE);
  pp <- seq_plot(df, logy = FALSE);
  jp <- joint_plot(df, logy = FALSE);
  lcp <- choose_plot(df, logy = TRUE);
  lpp <- seq_plot(df, logy = TRUE);
  ljp <- joint_plot(df, logy = TRUE);
  grid.newpage();
  grid.arrange(ggplotGrob(cp), ggplotGrob(lcp),
               ggplotGrob(pp), ggplotGrob(lpp),
               ggplotGrob(jp), ggplotGrob(ljp),
               ncol = 2);
}

df25 <- choose_df(0:25);
plot_all(df25);