---
title: "Gaussian Process"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Gaussian Process}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

In this vignette, we implement a Gaussian Process (GP) regression model from scratch.

A Gaussian Process is a collection of random variables, any finite subset of which follows a
joint Gaussian distribution.
A GP is fully specified by a mean function $m(\mathbf{x})$ and a covariance (kernel) function
$k(\mathbf{x}, \mathbf{x}')$:

$$f(\mathbf{x}) \sim \mathcal{GP}\bigl(m(\mathbf{x}),\; k(\mathbf{x}, \mathbf{x}')\bigr)$$

For any finite set of inputs $\{\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(n)}\}$, the corresponding
function values $\mathbf{f} = [f(\mathbf{x}^{(1)}), \dots, f(\mathbf{x}^{(n)})]^\top$ are jointly
Gaussian:

$$\mathbf{f} \sim \mathcal{N}\bigl(\mathbf{m},\; \mathbf{K}\bigr)$$

where $\mathbf{m}_i = m(\mathbf{x}^{(i)})$ and
$\mathbf{K}_{ij} = k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})$.
We assume a zero mean function $m(\mathbf{x}) = 0$ throughout this vignette (which is standard
practice, since the kernel already provides enough flexibility).

## Kernel

We use the squared exponential (also called RBF) kernel:

$$k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\!\left(-\frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2 \ell^2}\right)$$

There are also many other kernels that can be used, such as the Matérn kernel, the periodic kernel, and the exponential cosine kernel.
Below, we implement a function that computes the kernel matrix for two sets of points:

```{r}
library(anvl)

# X1: (n, d) array
# X2: (m, d) array
# lengthscale, signal_var: () array
rbf_kernel_matrix <- function(X1, X2, lengthscale, signal_var) {
  n <- shape(X1)[1L]
  m <- shape(X2)[1L]
  d <- shape(X1)[2L]
  diff <- nv_broadcast_arrays(
    nv_reshape(X1, c(n, 1L, d)),
    nv_reshape(X2, c(1L, m, d))
  )
  diff <- diff[[1L]] - diff[[2L]]
  sq_dist <- nv_reduce_sum(diff * diff, dims = 3L)
  signal_var * exp(-sq_dist / (2 * lengthscale^2))
}
```

Because `anvl` jit-compiles the code, there is no performance penalty for custom kernels, other than if we would use a C++ library that has a number of hard-coded kernels available.

## Joint Distribution

Suppose we have $n$ training inputs collected in a design matrix $\mathbf{X}$ and the
corresponding function values
$\mathbf{f} = [f(\mathbf{x}^{(1)}), \dots, f(\mathbf{x}^{(n)})]^\top$.
We observe noisy targets $\mathbf{y} = \mathbf{f} + \pmb{\epsilon}$, where
$\pmb{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})$.
Given $n_*$ test inputs $\mathbf{X}_*$, we want to predict the latent function values
$\mathbf{f}_* = [f(\mathbf{x}_*^{(1)}), \dots, f(\mathbf{x}_*^{(n_*)})]^\top$.
Assuming a zero-mean GP prior $\mathcal{GP}(\mathbf{0}, k(\mathbf{x}, \mathbf{x}'))$,
the joint distribution of the observed values and the latent test values is:

$$\begin{bmatrix} \mathbf{y} \\ \mathbf{f}_* \end{bmatrix}
\sim \mathcal{N}\!\left(
\mathbf{0},\;
\begin{bmatrix}
\mathbf{K} + \sigma^2 \mathbf{I} & \mathbf{K}_* \\
\mathbf{K}_*^\top & \mathbf{K}_{**}
\end{bmatrix}
\right)$$

where $\mathbf{K} = \bigl(k(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})\bigr)_{i,j}$
is the $n \times n$ training kernel matrix,
$\mathbf{K}_* = \bigl(k(\mathbf{x}^{(i)}, \mathbf{x}_*^{(j)})\bigr)_{i,j}$
is the $n \times n_*$ cross-covariance between training and test points,
and $\mathbf{K}_{**}$ is the $n_* \times n_*$ kernel matrix among test points.

## Posterior Predictive Distribution

We want to infer the distribution of $\mathbf{f}_*$ given the observed data $\mathbf{y}$.
By applying the general rule for conditioning of Gaussian random variables to the joint
distribution above, we obtain the posterior predictive distribution:

$$\mathbf{f}_* \mid \mathbf{X}, \mathbf{y}, \mathbf{X}_* \sim \mathcal{N}(\pmb{\mu}_*, \pmb{\Sigma}_*)$$

with

$$\pmb{\mu}_* = \mathbf{K}_*^\top \mathbf{K}_y^{-1} \mathbf{y}$$
$$\pmb{\Sigma}_* = \mathbf{K}_{**} - \mathbf{K}_*^\top \mathbf{K}_y^{-1} \mathbf{K}_*$$

where $\mathbf{K}_y := \mathbf{K} + \sigma^2 \mathbf{I}$.

The mean $\pmb{\mu}_*$ is a linear combination of the training observations, weighted by
the kernel similarity between the test and training points. The covariance $\pmb{\Sigma}_*$
is the prior covariance minus a term that shrinks the uncertainty wherever training data is
available.

Note that prediction is purely a matter of matrix computation --- given the kernel
hyperparameters, all quantities follow directly from the formulas above.

We now apply this to a synthetic example. We generate training data from a function that
combines a sinusoid with a linear trend and a localized bump:

$$f(x) = \sin(x) + 0.3\,x + \exp\!\left(-2\,(x - 2)^2\right)$$

```{r}
set.seed(42)

true_fn <- function(x) sin(x) + 0.3 * x + exp(-2 * (x - 2)^2)

n_train <- 20
noise_sd <- 0.2

X_train <- sort(runif(n_train, -5, 5))
y_train <- true_fn(X_train) + rnorm(n_train, sd = noise_sd)

n_test <- 100
X_test <- seq(-6, 6, length.out = n_test)
```

```{r, echo=FALSE}
plot(X_train, y_train, pch = 19, xlab = "x", ylab = "y", main = "Training Data")
curve(true_fn, from = -6, to = 6, add = TRUE, col = "blue", lwd = 2)
legend("topright", legend = c("Observations", "True function"),
  pch = c(19, NA), lty = c(NA, 1), col = c("black", "blue"), lwd = c(NA, 2))
```

The `predict_gp` function implements the posterior predictive formulas by Cholesky-
factorizing $\mathbf{K}_y = \mathbf{L} \mathbf{L}^\top$ and applying `nv_triangular_solve`
twice (a forward solve against $\mathbf{L}$ and a back solve against $\mathbf{L}^\top$),
rather than explicitly inverting the kernel matrix.

```{r}
chol_solve <- function(L, b) {
  # solves K_y x = b where K_y = L L^T (L lower triangular)
  y <- nv_triangular_solve(L, b,
    left_side = TRUE, lower = TRUE,
    unit_diagonal = FALSE, transpose_a = FALSE
  )
  nv_triangular_solve(L, y,
    left_side = TRUE, lower = TRUE,
    unit_diagonal = FALSE, transpose_a = TRUE
  )
}

predict_gp <- jit(function(kernel, X_train, y_train, X_test, lengthscale, signal_var, noise_var) {
  n <- shape(X_train)[1L]

  K <- kernel(X_train, X_train, lengthscale, signal_var)
  K_y <- K + noise_var * nv_eye(n, dtype = dtype(K))

  K_s <- kernel(X_test, X_train, lengthscale, signal_var)
  K_ss <- kernel(X_test, X_test, lengthscale, signal_var)

  L <- nv_chol(K_y, lower = TRUE)

  # Predictive mean: K_s %*% K_y^{-1} %*% y
  alpha <- chol_solve(L, y_train)
  mu <- K_s %*% alpha

  # Predictive covariance: K_ss - K_s %*% K_y^{-1} %*% K_s^T
  v <- chol_solve(L, t(K_s))
  cov <- K_ss - K_s %*% v

  list(mu = mu, cov = cov)
}, static = "kernel")
```

We apply this with $\ell = 0.5$, $\sigma_f^2 = 2$, and $\sigma^2 = 0.04$:

```{r}
X_t <- nv_matrix(X_train, ncol = 1)
y_t <- nv_matrix(y_train, ncol = 1)
X_test_t <- nv_matrix(X_test, ncol = 1)

pred <- predict_gp(
  rbf_kernel_matrix, X_t, y_t, X_test_t,
  lengthscale = nv_scalar(0.5),
  signal_var = nv_scalar(2),
  noise_var = nv_scalar(0.04)
)
```

```{r, fig.width=7, fig.height=5, echo=FALSE}
mu_pred <- as.vector(pred$mu)
var_pred <- diag(as_array(pred$cov))
sd_pred <- sqrt(pmax(var_pred, 0))
plot(X_test, mu_pred, type = "l", lwd = 2, col = "red",
  ylim = range(c(mu_pred - 2 * sd_pred, mu_pred + 2 * sd_pred, y_train)),
  xlab = "x", ylab = "y",
  main = "GP prediction with hand-picked hyperparameters")
polygon(
  c(X_test, rev(X_test)),
  c(mu_pred + 2 * sd_pred, rev(mu_pred - 2 * sd_pred)),
  col = rgb(1, 0, 0, 0.15), border = NA
)
lines(X_test, mu_pred, lwd = 2, col = "red")
points(X_train, y_train, pch = 19, cex = 0.8)
curve(true_fn, from = -6, to = 6, add = TRUE, col = "blue", lwd = 2, lty = 2)
legend("topright",
  legend = c("GP mean", "95% interval", "Observations", "True function"),
  lty = c(1, NA, NA, 2), lwd = c(2, NA, NA, 2), pch = c(NA, 15, 19, NA),
  col = c("red", rgb(1, 0, 0, 0.3), "black", "blue"))
```

Next, we will focus on learning these parameters by maximizing the marginal likelihood.

## Marginal Likelihood

With $\pmb{\theta} = (\ell, \sigma_f^2, \sigma^2)$, we can write down the log marginal likelihood as follows:

$$p(\mathbf{y} \mid \mathbf{X}, \pmb{\theta}) = \int p(\mathbf{y} \mid \mathbf{f}, \mathbf{X}) \, p(\mathbf{f} \mid \mathbf{X}, \pmb{\theta}) \, d\mathbf{f}$$

Because both the likelihood $p(\mathbf{y} \mid \mathbf{f}) = \mathcal{N}(\mathbf{f}, \sigma^2 \mathbf{I})$
and the prior $p(\mathbf{f}) = \mathcal{N}(\mathbf{0}, \mathbf{K})$ are Gaussian, this integral has a
closed-form solution:

$$\log p(\mathbf{y} \mid \mathbf{X}, \pmb{\theta})
= \underbrace{-\tfrac{1}{2} \mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y}}_{\text{data fit}}
  \;\underbrace{- \tfrac{1}{2} \log |\mathbf{K}_y|}_{\text{complexity penalty}}
  \;\underbrace{- \tfrac{n}{2} \log 2\pi}_{\text{constant}}$$

The three terms have interpretable roles, which can be understood by considering how they
behave as the lengthscale $\ell$ increases (i.e. the model becomes smoother and less flexible):

- The **data fit** $-\frac{1}{2} \mathbf{y}^\top \mathbf{K}_y^{-1} \mathbf{y}$ measures how well
  the model explains the observations. A small $\ell$ allows the GP to interpolate the data
  closely, giving a good fit; a large $\ell$ produces an overly smooth function that fits the
  data poorly.
- The **complexity penalty** $-\frac{1}{2} \log |\mathbf{K}_y|$ depends only on the covariance
  function and increases with the lengthscale, because a smoother model is less complex (higher
  penalty value means less penalization). A small $\ell$ leads to a flexible model with a
  large $|\mathbf{K}_y|$ and thus a heavy penalty.
- The **constant** $-\frac{n}{2} \log 2\pi$ is a normalization term independent of
  $\pmb{\theta}$.

Below, we implement the negative log marginal likelihood but leave out the constant term because we don't need it for optimization.

```{r}
neg_log_marginal_likelihood <- function(kernel, X, y, lengthscale, signal_var, noise_var) {
  n <- shape(y)[1L]

  K <- kernel(X, X, lengthscale, signal_var)
  eye <- nv_eye(n, dtype = dtype(K))
  K_y <- K + noise_var * eye

  L <- nv_chol(K_y, lower = TRUE)

  # alpha = K_y^{-1} y, computed via L L^T x = y
  z <- nv_triangular_solve(L, y,
    left_side = TRUE, lower = TRUE,
    unit_diagonal = FALSE, transpose_a = FALSE
  )
  alpha <- nv_triangular_solve(L, z,
    left_side = TRUE, lower = TRUE,
    unit_diagonal = FALSE, transpose_a = TRUE
  )

  # Data fit term: 0.5 * y^T %*% alpha
  data_fit <- 0.5 * nv_reduce_sum(y * alpha, dims = c(1L, 2L))

  # Log determinant: sum(log(diag(L)))
  diag_L <- nv_reduce_sum(L * eye, dims = 2L)
  log_det <- nv_reduce_sum(log(diag_L), dims = 1L)

  data_fit + log_det
}
```

### Optimization

We optimize the hyperparameters using gradient descent. Since $\ell$, $\sigma_f^2$, and
$\sigma^2$ must be positive, we optimize on the log scale to allow unconstrained updates.
`anvl` computes the gradients via `gradient()` and the entire training loop is
JIT-compiled with `nv_while`.

```{r}
params <- c("log_lengthscale", "log_signal_var", "log_noise_var")
nll_grad <- gradient(\(X, y, log_lengthscale, log_signal_var, log_noise_var) {
  neg_log_marginal_likelihood(
    rbf_kernel_matrix, X, y,
    exp(log_lengthscale), exp(log_signal_var), exp(log_noise_var)
  )
}, wrt = params)

train_gp <- jit(function(X, y, lengthscale, signal_var, noise_var,
  n_steps, learning_rate) {
  result <- nv_while(
    list(
      log_lengthscale = log(lengthscale),
      log_signal_var = log(signal_var),
      log_noise_var = log(noise_var),
      step = 0L
    ),
    \(log_lengthscale, log_signal_var, log_noise_var, step) step < n_steps,
    \(log_lengthscale, log_signal_var, log_noise_var, step) {
      grads <- nll_grad(X, y, log_lengthscale, log_signal_var, log_noise_var)
      list(
        log_lengthscale = log_lengthscale - learning_rate * grads$log_lengthscale,
        log_signal_var = log_signal_var - learning_rate * grads$log_signal_var,
        log_noise_var = log_noise_var - learning_rate * grads$log_noise_var,
        step = step + 1L
      )
    }
  )
  list(
    lengthscale = exp(result$log_lengthscale),
    signal_var = exp(result$log_signal_var),
    noise_var = exp(result$log_noise_var)
  )
})
```

```{r}
result <- train_gp(
  X_t, y_t,
  lengthscale = nv_scalar(1),
  signal_var = nv_scalar(1),
  noise_var = nv_scalar(exp(-1)),
  n_steps = nv_scalar(500L),
  learning_rate = nv_scalar(0.01)
)

result
```

### Results

Now we can use the learned hyperparameters for prediction just like we have done above and visualize the results.

```{r, fig.width=7, fig.height=5, echo=FALSE}
pred <- predict_gp(
  rbf_kernel_matrix, X_t, y_t, X_test_t,
  result$lengthscale, result$signal_var, result$noise_var
)

mu <- as.vector(pred$mu)
var <- diag(as_array(pred$cov))
sd <- sqrt(pmax(var, 0))
plot(X_test, mu, type = "l", lwd = 2, col = "red",
  ylim = range(c(mu - 2 * sd, mu + 2 * sd, y_train)),
  xlab = "x", ylab = "y", main = "GP with learned hyperparameters")
polygon(
  c(X_test, rev(X_test)),
  c(mu + 2 * sd, rev(mu - 2 * sd)),
  col = rgb(1, 0, 0, 0.15), border = NA
)
lines(X_test, mu, lwd = 2, col = "red")
points(X_train, y_train, pch = 19)
curve(true_fn, from = -6, to = 6, add = TRUE, col = "blue", lwd = 2, lty = 2)
legend("topright",
  legend = c("GP mean", "95% interval", "Observations", "True function"),
  lty = c(1, NA, NA, 2), lwd = c(2, NA, NA, 2), pch = c(NA, 15, 19, NA),
  col = c("red", rgb(1, 0, 0, 0.3), "black", "blue"))
```

Compared to the hand-picked hyperparameters, the learned ones produce a much better fit
with tighter uncertainty bands near the training data and wider bands where data is sparse.

## Convergence with More Data

As the number of observations increases, the GP posterior should converge to the true function
and the uncertainty should shrink. We demonstrate this by re-fitting the hyperparameters
for each dataset size. Since we only sample training points within $[-5, 5]$, the
uncertainty grows beyond that range where no data is available.

```{r, fig.width=7, fig.height=8, echo=FALSE}
set.seed(123)
n_values <- c(10, 20, 40, 80)
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))

for (n in n_values) {
  X_dense <- sort(runif(n, -5, 5))
  y_dense <- true_fn(X_dense) + rnorm(n, sd = noise_sd)

  X_dense_t <- nv_matrix(X_dense, ncol = 1)
  y_dense_t <- nv_matrix(y_dense, ncol = 1)

  res_dense <- train_gp(
    X_dense_t, y_dense_t,
    lengthscale = nv_scalar(1),
    signal_var = nv_scalar(1),
    noise_var = nv_scalar(exp(-1)),
    n_steps = nv_scalar(500L),
    learning_rate = nv_scalar(0.01)
  )

  pred_dense <- predict_gp(
    rbf_kernel_matrix, X_dense_t, y_dense_t, X_test_t,
    res_dense$lengthscale, res_dense$signal_var, res_dense$noise_var
  )

  mu_d <- as.vector(pred_dense$mu)
  var_d <- diag(as_array(pred_dense$cov))
  sd_d <- sqrt(pmax(var_d, 0))

  plot(X_test, mu_d, type = "l", lwd = 2, col = "red",
    ylim = c(-4, 5), xlab = "x", ylab = "y",
    main = sprintf("n = %d", n))
  polygon(
    c(X_test, rev(X_test)),
    c(mu_d + 2 * sd_d, rev(mu_d - 2 * sd_d)),
    col = rgb(1, 0, 0, 0.15), border = NA
  )
  lines(X_test, mu_d, lwd = 2, col = "red")
  curve(true_fn, from = -6, to = 6, add = TRUE, col = "blue", lwd = 2, lty = 2)
}
```

With more observations the GP mean closely tracks $f(x)$ and the uncertainty bands become
negligibly thin, confirming convergence to the true function.

## Further Reading

For a more in-depth treatment of Gaussian Processes, see the
[Introduction to Machine Learning lecture on Gaussian Processes](https://slds-lmu.github.io/i2ml/chapters/19_gaussian_processes/).