Adjusted Sampling Under the Curse of Dimensionality

Overview¶

Various optimization methods rely on iteratively sampling and evaluating potential solutions, with the goal of converging on a globally or locally optimal solution. As a nasty side effect of random sampling in high-dimensional feature space, sampled points tend to bunch up in a narrow region rather than spreading out evenly across the explorable volume. Richard Bellman (of the famous Bellman equation in dynamic programming) called this phenomenon and other related symptoms “The Curse of Dimensionality.”

We will try to visualize this effect and gain an intuition for how it can impact analyses. Then, we will explore a potential solution for handling the issue by rebalancing the sampled space to be more uniform.

Imports and Definitions¶

In [1]:

from matplotlib.gridspec import GridSpec
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm


def get_distances(x: np.ndarray, show_progress: bool = True) -> np.ndarray:
    """Computes the m * ( - 1) / 2 unique pairwise Euclidian distances
    between all m vectors of length n in x, given that x is an (m, n) 
    matrix.
    """
    disable = not show_progress
    ncols = 80
    m = x.shape[0]
    n = x.shape[1]
    size = m * (m - 1) // 2
    r = np.full(shape=(size,), fill_value=np.nan)
    j = 0
    for i in tqdm(range(m-1), ncols=ncols, disable=disable):
        distances = np.sqrt(
            ((x[i, :] - x[i+1:, :])**2).sum(axis=1)
        )
        k = distances.shape[0]
        r[j:j+k] = distances
        j += k
    return r

def compute_cdf(
    X: np.ndarray | float | int, 
    x: np.ndarray  | float | int,
) -> np.ndarray:
    """Computes F(x) = P(X <= x),the probability of X being less
    than or equal to x. If X is (m,) and x is (n,), returns
    an array of length n.
    """
    return (np.array(X).reshape(-1, 1) <= np.array(x).reshape(1, -1)).mean(axis=0)

def compute_inverse_cdf(
    X: np.ndarray | float | int, 
    p: np.ndarray | float | int,
) -> np.ndarray:
    """Computes the quantile F_INV(p) = x, in other words, the value
    of x that satisfies F(x) = P(X <= x) = p.
    """
    return np.quantile(np.array(X), np.array(p))


MAX_GIGABYTES = 2.0
BYTES_PER_ELEMENT = 8
MAX_BYTES = MAX_GIGABYTES * 1e9

Observing the Problem¶

In [2]:

seed = 42  # Random number generator seed
dimension_vec = [2**exp for exp in [1, 3, 5, 7]]  # Dims to explore
n_plot_pts = int(1e3)  # Number of points on scatter plot
ax_width = 2  # Number of grid rows occupied by all but the bottom plots
q = np.arange(0.1, 1.01, 0.01)  # Quantiles to plot

# Parameters for Normal distributions
mu = 0.0
std = 1.0

# Parameters for Uniform distributions
uniform_low = -1
uniform_high = 1

# Parameters defining boundaries of plots
xlim0 = (0, 15)
xlim1 = (0, 20)
xlim2 = (-2, 2)
ylim2 = (-2, 2)

Sampling Features from Standard Normal Distributions¶

In [3]:

dist_params = {'loc': mu, 'scale': std}

fig = plt.figure(figsize=(12, 9), constrained_layout=True)
gs = GridSpec(9, len(dimension_vec), figure=fig)

for idx, dimensions in enumerate(dimension_vec):
    m_dim = int(MAX_BYTES / (BYTES_PER_ELEMENT * dimensions))
    m_dist = int(np.sqrt(2 * MAX_BYTES / BYTES_PER_ELEMENT)) # Approximate boundary
    m_sims = min(m_dim, m_dist)
    
    rng = np.random.default_rng(seed=seed)
    x = rng.normal(**dist_params, size=(m_sims, dimensions))  # Sample from standard normal
    r = np.sqrt((x**2).sum(axis=1))
    dists = get_distances(x)
    
    rho = rng.normal(loc=r.mean(), scale=r.std(), size=(n_plot_pts,)) / r.mean()
    theta = rng.uniform(low=0., high=2 * np.pi, size=(n_plot_pts,))
    x_ = rho * np.cos(theta)
    y_ = rho * np.sin(theta)

    quantiles = q * r.max()
    cdf = (r.reshape(-1, 1) <= quantiles.reshape(1, -1)).mean(axis=0)

    # Plot visualizations
    ax00 = fig.add_subplot(gs[:ax_width, idx])
    ax00.hist(r, bins=100, density=True)
    ax00.set_title(f"Dim = {dimensions}\n\nDistance from Origin\nN = {m_sims:,}")
    ax00.set_ylabel("PDF")
    ax00.set_xlabel("Distance")
    ax00.set_xlim(xlim0)
    
    ax01 = fig.add_subplot(gs[ax_width:2 * ax_width, idx])
    ax01.hist(dists, bins=100, density=True)
    ax01.set_title(f"Distance Between Points\nN = {dists.shape[0]:,}")
    ax01.set_ylabel("PDF")
    ax01.set_xlabel("Distance")
    ax01.set_xlim(xlim1)

    ax02 = fig.add_subplot(gs[2 * ax_width:3 * ax_width, idx])
    ax02.plot(q, cdf)
    ax02.set_title("Radial Distribution\nRelative to " + r"$R_{max}$")
    ax02.set_ylabel("CDF")
    ax02.set_xlabel(r"R / $R_{max}$")
    
    ax03 = fig.add_subplot(gs[3 * ax_width:, idx])
    ax03.scatter(x_, y_, s=5)
    ax03.set_aspect('equal')
    ax03.set_title(f"Projected Sphere\nNormalized by " + r"$\overline{R}$")
    ax03.set_xlim(xlim2)
    ax03.set_ylim(ylim2)

plt.show()

100%|███████████████████████████████████| 22359/22359 [00:02<00:00, 8160.27it/s]
100%|███████████████████████████████████| 22359/22359 [00:03<00:00, 6154.73it/s]
100%|███████████████████████████████████| 22359/22359 [00:09<00:00, 2413.95it/s]
100%|████████████████████████████████████| 22359/22359 [00:26<00:00, 857.55it/s]

No description has been provided for this image

Sampling Features from Uniform Distributions: U(-1, 1)¶

In [4]:

dist_params = {'low': uniform_low, 'high': uniform_high}

fig = plt.figure(figsize=(12, 9), constrained_layout=True)
gs = GridSpec(9, len(dimension_vec), figure=fig)

for idx, dimensions in enumerate(dimension_vec):
    m_dim = int(MAX_BYTES / (BYTES_PER_ELEMENT * dimensions))
    m_dist = int(np.sqrt(2 * MAX_BYTES / BYTES_PER_ELEMENT)) # Approximate boundary
    m_sims = min(m_dim, m_dist)
    
    rng = np.random.default_rng(seed=seed)
    x = rng.uniform(**dist_params, size=(m_sims, dimensions))  # Sample from uniform
    r = np.sqrt((x**2).sum(axis=1))
    dists = get_distances(x)
    
    rho = rng.normal(loc=r.mean(), scale=r.std(), size=(n_plot_pts,)) / r.mean()
    theta = rng.uniform(low=0., high=2 * np.pi, size=(n_plot_pts,))
    x_ = rho * np.cos(theta)
    y_ = rho * np.sin(theta)

    quantiles = q * r.max()
    cdf = (r.reshape(-1, 1) <= quantiles.reshape(1, -1)).mean(axis=0)

    # Plot visualizations
    ax00 = fig.add_subplot(gs[:ax_width, idx])
    ax00.hist(r, bins=100, density=True)
    ax00.set_title(f"Dim = {dimensions}\n\nDistance from Origin\nN = {m_sims:,}")
    ax00.set_ylabel("PDF")
    ax00.set_xlabel("Distance")
    ax00.set_xlim(xlim0)
    
    ax01 = fig.add_subplot(gs[ax_width:2 * ax_width, idx])
    ax01.hist(dists, bins=100, density=True)
    ax01.set_title(f"Distance Between Points\nN = {dists.shape[0]:,}")
    ax01.set_ylabel("PDF")
    ax01.set_xlabel("Distance")
    ax01.set_xlim(xlim1)

    ax02 = fig.add_subplot(gs[2 * ax_width:3 * ax_width, idx])
    ax02.plot(q, cdf)
    ax02.set_title("Radial Distribution\nRelative to " + r"$R_{max}$")
    ax02.set_ylabel("CDF")
    ax02.set_xlabel(r"R / $R_{max}$")
    
    ax03 = fig.add_subplot(gs[3 * ax_width:, idx])
    ax03.scatter(x_, y_, s=5)
    ax03.set_aspect('equal')
    ax03.set_title(f"Projected Sphere\nNormalized by " + r"$\overline{R}$")
    ax03.set_xlim(xlim2)
    ax03.set_ylim(ylim2)

plt.show()

100%|███████████████████████████████████| 22359/22359 [00:02<00:00, 7956.73it/s]
100%|███████████████████████████████████| 22359/22359 [00:03<00:00, 6977.18it/s]
100%|███████████████████████████████████| 22359/22359 [00:09<00:00, 2397.43it/s]
100%|████████████████████████████████████| 22359/22359 [00:25<00:00, 884.96it/s]

In [5]:

print(f"Min Distance from Origin: {r.min():.2f}")
print(f"Max Distance from Origin: {r.max():.2f}")
print(f"Min Distance Between Points: {dists.min():.2f}")
print(f"Max Distance Between Points: {dists.max():.2f}")

Min Distance from Origin: 5.42
Max Distance from Origin: 7.53
Min Distance Between Points: 6.28
Max Distance Between Points: 11.98

In cases where features are sampled from standard normal or uniform distributions, we can see that increasing the number of features (the number of dimensions) results in samples that are farther apart from the origin and each other. However, no matter the number of dimensions, the variance of distances remains roughly constant. This results in samples that are distributed densely near the surface of a hypersphere and sparsely within. A two-dimensional scatter plot of the high-dimensional samples, normalized by the average Euclidean norm, tends to look more like a donut shape as the number of dimensions increases. This llustrates the fact that naive sampling methods will neglet a large portion of the possible solutions.

Even without normalizing samples by the Euclidean norm, we can observe that out of all 249,973,620 unique pairs of points with 128 dimensions, the pairwise distances between points ranged from 6.28 to 11.98. Put differently, in a 128-dimensional world full of 22,360 random points, there were zero points closer than 6.28 to any other point! Likewise, there were zero points farther than 11.98 from any other point. Try to visualize in your mind the conditions required to make all 22,360 points fall roughly within 6 to 12 units from every other point randomly and you’ll start to see how bizarre our situation is… and how detrimental it is to high-dimensional heuristic optimizers where the best answer might reside near the origin.

Thinking in terms of the space explored by our sampling method, we can compute the volume of an $n$-dimensional sphere as: $$V_n(\rho, n)=\frac{\pi^{n/2}}{\Gamma(\frac{n}{2}+1)}\rho^n$$

Considering that all 22,360 points fell between 5.42 and 7.53 units from the origin, we can compute the relative volume explored: $$ \begin{align} \frac{V_{explored}}{V_{available}} &= \frac{V(\rho_{max}, n) – V(\rho_{min}, n)}{V(\rho_{max}, n)} \\ &=\frac{V(7.53, 128) – V(5.42, 128)}{V(7.53, 128)} \\ &=\frac{9.0E54 – 4.1E36}{9.0E54} \\ &\approx1.0 \end{align} $$

How strange — even though our samples exist only in the outermost 28% of radial distances from the origin, their breadth spreads out over 100% of the search space!

It turns out that, in higher dimensions, all of the volume in space itself is concentrated in an outer shell-like region, far away from the origin. This is because volume is proportional to $\rho^n$, meaning it grows polynomially with the distance from the origin and exponentially with the number of dimensions, so there is vastly more space available near the outer boundaries of the search space. In many cases, the near-origin sparsity induced by The Curse of Dimensionality is not an issue, since barely any of the search space exists near the origin anyway. On the other hand, if we’re solving a problem heuristically and its solution exists comparatively close to the origin (for example, because we were too lenient in specifying the upper bounds of our feature space), than our current method of sampling may never find that solution.

Attempting a Solution¶

The following code represents my homemade approach to handling the near-origin sparsity of high-dimensional samples. Given the ubiquity of this phenomenon in data science and machine learning, I wouldn’t be surprised if somebody else figured this out a long time ago — if so, I would love to know the name of the method! That said, these are the steps I would like to try:

For each $n$-dimensional vector $\mathbf{\vec{X_j}}=\langle{x_{j0}, x_{j1}, …, x_{jn}}\rangle$ in $m$ samples, compute the Euclidean norm $\rho_j = |\mathbf{X_j}|$, which represents distance between the vector’s head (the sampled data point) and the origin.
Compute the cumulative probability function (CDF) of the $m$ Euclidean norms $\mathbf{\vec{P}}=\langle{\rho_{0}, \rho_{1}, …, \rho_{m}}\rangle$.
Use the resulting probabilities to compute the inverse CDF, but for a uniform distribution. This should transform the $m$ original Euclidean norms from their imbalanced distribution $\rho \sim A$ to a well-balanced uniform distribution $\rho^\prime \sim U(0, \rho_{max})$, where $\rho_{max}$ is the largest Euclidean norm in the original sample $\mathbf{\vec{P}}$. Alternatively, $\rho_{max}$ can be set to some desired upper bound.
Multiply each coordinate $x_{ji}$ by the ratio of norms $(\rho_j^\prime / \rho_j)$ to produce the rebalanced coordinate $x_{ji}^\prime$, where $i=0,1,…,n$ and $j=0,1,…,m$.

That last step works because of the following logic: $$\rho_j = \sqrt{\sum_{i=0}^{N}{x_{ji}^2}}$$ $$\rho_j^\prime = \sqrt{\sum_{i=0}^{N}{x_{ji}^{\prime 2}}}$$ $$\frac{\rho_j^\prime}{\rho_j}=\frac{\sqrt{\sum_{i=0}^{N}{x_{ji}^{\prime 2}}}}{\sqrt{\sum_{i=0}^{N}{x_{ji}^2}}}$$ $$\sqrt{\sum_{i=0}^{N}{x_{ji}^{\prime 2}}}=\frac{\rho_j^\prime}{\rho_j}\sqrt{\sum_{i=0}^{N}{x_{ji}^2}}$$ $$\sum_{i=0}^{N}{x_{ji}^{\prime 2}}=\left(\frac{\rho_j^\prime}{\rho_j}\right)^2\sum_{i=0}^{N}{x_{ji}^2}$$ Assuming we scale each component by the same proportion, then: $$x_{ji}^{\prime 2}=\left(\frac{\rho_j^\prime}{\rho_j}\right)^2 x_{ji}^2$$ $$x_{ji}^{\prime}=\left(\frac{\rho_j^\prime}{\rho_j}\right) x_{ji}$$

In the second-to-last line, we assumed that all $x_{ji}^2$ terms are scaled by the same proportion, which conveniently is $\left(\frac{\rho_j^\prime}{\rho_j}\right)^2$. There are infinitely many linear combinations of $x_{ji}^{\prime 2}$ for which this assumption is not true, but since our goal is to rebalance the samples across our feature space and not to resample them entirely, scaling all the terms by the same ratio allows us to preserve the relative magnitudes and proportions of components in the original vector.

The new vector should be uniformly distributed in the high-dimensional feature space, as far as I can tell. Let’s find out!

In [6]:

def redistribute_high_dimensional_samples_uniformly(x: np.ndarray) -> np.ndarray:
    """Accepts a matrix x of shape (m, n) -- that is, m vectors of dimension n
    -- and adjusts each vector to overcome near-origin sparsity caused by the 
    Curse of Dimensionality. Returns an (m, n) matrix containing the redistributed 
    samples, which should now fill hyperspace more uniformly.
    """
    r = np.sqrt((x**2).sum(axis=1))
    v = np.linspace(0., 1., r.shape[0]) * r.max()
    u = compute_cdf(X=r, x=r)
    r_new = compute_inverse_cdf(X=v, p=u)
    x_new = x * (r_new / r).reshape(-1, 1)
    return x_new

Test the Solution on Features Sampled from Standard Normal Distributions¶

In [7]:

dist_params = {'loc': mu, 'scale': std}

fig = plt.figure(figsize=(12, 9), constrained_layout=True)
gs = GridSpec(9, len(dimension_vec), figure=fig)

for idx, dimensions in enumerate(dimension_vec):
    m_dim = int(MAX_BYTES / (BYTES_PER_ELEMENT * dimensions))
    m_dist = int(np.sqrt(2 * MAX_BYTES / BYTES_PER_ELEMENT)) # Approximate boundary
    m_sims = min(m_dim, m_dist)
    
    rng = np.random.default_rng(seed=seed)
    x_ = rng.normal(**dist_params, size=(m_sims, dimensions))  # Sample from standard normal

    # APPLY SAMPLING CORRECTION
    x = redistribute_high_dimensional_samples_uniformly(x_)
    r = np.sqrt((x**2).sum(axis=1))
    
    dists = get_distances(x)
    
    rho = rng.normal(loc=r.mean(), scale=r.std(), size=(n_plot_pts,)) / r.mean()
    theta = rng.uniform(low=0., high=2 * np.pi, size=(n_plot_pts,))
    x_ = rho * np.cos(theta)
    y_ = rho * np.sin(theta)

    quantiles = q * r.max()
    cdf = (r.reshape(-1, 1) <= quantiles.reshape(1, -1)).mean(axis=0)

    
    ax00 = fig.add_subplot(gs[:ax_width, idx])
    ax00.hist(r, bins=100, density=True)
    ax00.set_title(f"Dim = {dimensions}\n\nDistance from Origin\nN = {m_sims:,}")
    ax00.set_ylabel("PDF")
    ax00.set_xlabel("Distance")
    ax00.set_xlim(xlim0)
    
    ax01 = fig.add_subplot(gs[ax_width:2 * ax_width, idx])
    ax01.hist(dists, bins=100, density=True)
    ax01.set_title(f"Distance Between Points\nN = {dists.shape[0]:,}")
    ax01.set_ylabel("PDF")
    ax01.set_xlabel("Distance")
    ax01.set_xlim(xlim1)

    ax02 = fig.add_subplot(gs[2 * ax_width:3 * ax_width, idx])
    ax02.plot(q, cdf)
    ax02.set_title("Radial Distribution\nRelative to " + r"$R_{max}$")
    ax02.set_ylabel("CDF")
    ax02.set_xlabel(r"R / $R_{max}$")
    
    ax03 = fig.add_subplot(gs[3 * ax_width:, idx])
    ax03.scatter(x_, y_, s=5)
    ax03.set_aspect('equal')
    ax03.set_title(f"Projected Sphere\nNormalized by " + r"$\overline{R}$")
    ax03.set_xlim(xlim2)
    ax03.set_ylim(ylim2)

plt.show()

100%|███████████████████████████████████| 22359/22359 [00:02<00:00, 8163.73it/s]
100%|███████████████████████████████████| 22359/22359 [00:03<00:00, 6830.02it/s]
100%|███████████████████████████████████| 22359/22359 [00:09<00:00, 2332.63it/s]
100%|████████████████████████████████████| 22359/22359 [00:26<00:00, 859.31it/s]

Test the Solution on Features Sampled from Uniform Distributions: U(-1, 1)¶

In [8]:

dist_params = {'low': uniform_low, 'high': uniform_high}

fig = plt.figure(figsize=(12, 9), constrained_layout=True)
gs = GridSpec(9, len(dimension_vec), figure=fig)

for idx, dimensions in enumerate(dimension_vec):
    m_dim = int(MAX_BYTES / (BYTES_PER_ELEMENT * dimensions))
    m_dist = int(np.sqrt(2 * MAX_BYTES / BYTES_PER_ELEMENT))  # Approximate boundary
    m_sims = min(m_dim, m_dist)
    
    rng = np.random.default_rng(seed=seed)
    x_ = rng.uniform(**dist_params, size=(m_sims, dimensions))  # Sample from uniform

    # APPLY SAMPLING CORRECTION
    x = redistribute_high_dimensional_samples_uniformly(x_)
    r = np.sqrt((x**2).sum(axis=1))
    
    dists = get_distances(x)
    
    rho = rng.normal(loc=r.mean(), scale=r.std(), size=(n_plot_pts,)) / r.mean()
    theta = rng.uniform(low=0., high=2 * np.pi, size=(n_plot_pts,))
    x_ = rho * np.cos(theta)
    y_ = rho * np.sin(theta)

    quantiles = q * r.max()
    cdf = (r.reshape(-1, 1) <= quantiles.reshape(1, -1)).mean(axis=0)

    
    ax00 = fig.add_subplot(gs[:ax_width, idx])
    ax00.hist(r, bins=100, density=True)
    ax00.set_title(f"Dim = {dimensions}\n\nDistance from Origin\nN = {m_sims:,}")
    ax00.set_ylabel("PDF")
    ax00.set_xlabel("Distance")
    ax00.set_xlim(xlim0)
    
    ax01 = fig.add_subplot(gs[ax_width:2 * ax_width, idx])
    ax01.hist(dists, bins=100, density=True)
    ax01.set_title(f"Distance Between Points\nN = {dists.shape[0]:,}")
    ax01.set_ylabel("PDF")
    ax01.set_xlabel("Distance")
    ax01.set_xlim(xlim1)

    ax02 = fig.add_subplot(gs[2 * ax_width:3 * ax_width, idx])
    ax02.plot(q, cdf)
    ax02.set_title("Radial Distribution\nRelative to " + r"$R_{max}$")
    ax02.set_ylabel("CDF")
    ax02.set_xlabel(r"R / $R_{max}$")
    
    ax03 = fig.add_subplot(gs[3 * ax_width:, idx])
    ax03.scatter(x_, y_, s=5)
    ax03.set_aspect('equal')
    ax03.set_title(f"Projected Sphere\nNormalized by " + r"$\overline{R}$")
    ax03.set_xlim(xlim2)
    ax03.set_ylim(ylim2)

plt.show()

100%|███████████████████████████████████| 22359/22359 [00:03<00:00, 7385.71it/s]
100%|███████████████████████████████████| 22359/22359 [00:03<00:00, 6691.88it/s]
100%|███████████████████████████████████| 22359/22359 [00:09<00:00, 2257.38it/s]
100%|████████████████████████████████████| 22359/22359 [00:27<00:00, 810.61it/s]

Redistributing the samples via CDF transformations worked smashingly. The resulting samples are uniformly distributed in their distance from the origin, the CDFs of their radial distributions confirms this, adjacent points can now be positioned very close to eachother with a wider range of possible distances, and the 2-dimensional hypersphere projections show points that fill the empty space evenly.

Summmary¶

We saw how increasing the number of dimensions in our search space resulted in narrowly distributed samples close to the surface of a hypershpere (appearing as a donut shape in two dimensions). In an optimization problem where each N-dimensional point represents a potential solution, this means that randomly sampling from normal and uniform distributions will fail to explore a large volume of potential solutions inside the hypersphere (the “donut hole”).

By scaling samples with a constant derived from the ratio of observed and desired radial densities, we succesfully redistributed our samples evenly throughout N-dimensional feature space. We found the scaling factor by computing the CDF of the Euclidean (L2) norms from the original sample, which yielded an array of probabilities in the original distribution. We applied the inverse CDF for a uniform distribution to these probabilities, resulting in corresponding Euclidean norms drawn from a uniform distribution. For each sample, we then computed the scaling factor as the ratio of the uniformly distributed Euclidean norm divided by the original Euclidean norm. The scaling factor allowed us to map each coordinate of the original sample to its respective coordinate in the rebalanced sample.

In [ ]: