---
title: "Metropolis-Hastings"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Metropolis-Hastings}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

In this vignette, we show how to implement a random-walk Metropolis-Hastings sampler using {anvl} [@hastings1970monte].
Metropolis-Hastings is a well known MCMC algorithm that generates samples from a target distribution by proposing moves and accepting or rejecting them based on the density ratio.

## The Banana Distribution

We sample from a "banana-shaped" distribution [@haario1999adaptive].
This 2-dimensional distribution consists of two Gaussian densities, where the second is dependent on the first:

$$p(\theta_1, \theta_2) = \underbrace{\frac{1}{\sqrt{200\pi}} \exp\!\left(-\frac{\theta_1^2}{200}\right)}_{p(\theta_1)} \;\cdot\; \underbrace{\frac{1}{\sqrt{2\pi}} \exp\!\left(-\frac{\bigl(\theta_2 - b\theta_1^2 + 100b\bigr)^2}{2}\right)}_{p(\theta_2 \mid \theta_1)}$$

While it is trivial to sample from this distribution directly, it serves as a useful test case for our Metropolis-Hastings sampler implementation.
The distribution has known analytical moments which we will use later to verify the correctness of our implementation.
We set the hyperparameter $b$ to 0.01.


```{r}
library(anvl)
b_param <- 0.01
```

```{r, echo = FALSE, fig.width = 7, fig.height = 5}
log_p <- function(x1, x2) {
  -x1^2 / 200 - (x2 - b_param * x1^2 + 100 * b_param)^2 / 2
}

x1_grid <- seq(-30, 30, length.out = 200)
x2_grid <- seq(-10, 15, length.out = 200)
z <- outer(x1_grid, x2_grid, log_p)

par(mar = c(4, 4, 2, 1))
contour(
  x1_grid, x2_grid, exp(z), nlevels = 7,
  xlab = expression(theta[1]), ylab = expression(theta[2]),
  main = "Banana Distribution", col = "grey40"
)
```

## Metropolis-Hastings Algorithm

The random-walk Metropolis-Hastings algorithm proceeds as follows:

1. Propose $\theta' = \theta + \varepsilon$, where $\varepsilon \sim \mathcal{N}(0, s^2 I)$
2. Compute the acceptance ratio $\alpha = p(\theta') / p(\theta)$
3. Accept $\theta'$ with probability $\min(1, \alpha)$

In practice, we work with log-densities rather than densities directly.
Density values can become extremely small in high-dimensional or concentrated distributions, causing floating-point underflow.
By operating in log-space, the acceptance ratio $p(\theta') / p(\theta)$ becomes a simple difference $\log p(\theta') - \log p(\theta)$, which is numerically stable.
Furthermore, because the algorithm only ever evaluates *ratios* of densities, any normalizing constant cancels out in the difference.
This means we only need to know the log-density up to an additive constant.

The log-density of the banana distribution (up to a constant) is:

$$\log p(\theta_1, \theta_2) = -\frac{\theta_1^2}{200} - \frac{\bigl(\theta_2 - b\theta_1^2 + 100b\bigr)^2}{2}$$

We implement this in {anvl} and convert the parameter to an `f64` array for maximal precision.

```{r}
b_t <- nv_scalar(b_param, dtype = "f64")

log_density <- function(theta, b) {
  theta1 <- theta[1]
  theta2 <- theta[2]
  -theta1^2 / 200 - (theta2 - b * theta1^2 + 100 * b)^2 / 2
}
```

There are a few things to note in the code below:

a) the accept/reject decision uses the primitive `nv_if()` and not a native R conditional.
b) we sample a scalar uniform random number by setting `shape = integer()`
c) the RNG state is passed around explicitly.


```{r}
mh_step <- function(theta, rng_state, b, proposal_sd) {
  rng_out <- nv_rnorm(shape(theta), rng_state, dtype = "f64")
  rng_state <- rng_out[[1L]]
  proposal <- theta + proposal_sd * rng_out[[2L]]

  log_current <- log_density(theta, b)
  log_proposed <- log_density(proposal, b)
  log_accept <- log_proposed - log_current

  rng_out <- nv_runif(integer(), rng_state, dtype = "f64")
  rng_state <- rng_out[[1L]]
  u <- rng_out[[2L]]

  accept <- log(u) < log_accept
  new_theta <- nv_if(accept, \() proposal, \() theta)
  list(theta = new_theta, rng_state = rng_state)
}
```

Because subsequent samples are autocorrelated, we thin the output by keeping every nth sample.
We implement this loop via `nv_while()` below, avoiding R loop overhead.

```{r}
mh_sample <- jit(function(theta, rng_state, b, proposal_sd, thin) {
  result <- nv_while(
    init = list(theta = theta, rng_state = rng_state, i = nv_scalar(0L)),
    cond = \(theta, rng_state, i) i < thin,
    body = \(theta, rng_state, i) {
      c(mh_step(theta, rng_state, b, proposal_sd), list(i = i + 1L))
    }
  )[1:2]
})
```

## Running the Sampler

We specify the required parameters and run the sampler.
Again, note that we are passing around the RNG state explicitly.

```{r}
n_samples <- 20000L
n_warmup <- 5000L
thin <- nv_scalar(200L)

theta <- nv_array(c(0, 0), dtype = "f64")
rng_state <- nv_rng_state(42L)
proposal_sd <- nv_scalar(3, dtype = "f64")

result <- mh_sample(theta, rng_state, b_t, proposal_sd, thin)
theta <- result$theta
rng_state <- result$rng_state

for (i in seq_len(n_warmup)) {
  result <- mh_sample(theta, rng_state, b_t, proposal_sd, thin)
  theta <- result$theta
  rng_state <- result$rng_state
}

thetas <- vector("list", n_samples)

for (i in seq_len(n_samples)) {
  result <- mh_sample(theta, rng_state, b_t, proposal_sd, thin)
  theta <- result$theta
  rng_state <- result$rng_state
  thetas[[i]] <- theta
}

samples <- do.call(rbind, lapply(thetas, function(x) as.vector(x)))
colnames(samples) <- c("theta1", "theta2")
```

Next, we compare the results with the analytical moments to see whether everything worked as expected.

## Inspecting the Results

The autocorrelation function (ACF) of the thinned samples shows that successive draws are nearly independent:

```{r, echo = FALSE, fig.width = 7, fig.height = 4}
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
acf(samples[, 1], main = expression(theta[1]), lag.max = 40)
acf(samples[, 2], main = expression(theta[2]), lag.max = 40)
```

The banana distribution has known analytical moments.
Since $\theta_1 \sim \mathcal{N}(0, 100)$ and $\theta_2 \mid \theta_1 \sim \mathcal{N}(b\theta_1^2 - 100b, 1)$:

- $\mathbb{E}[\theta_1] = 0$, $\text{SD}[\theta_1] = 10$
- $\mathbb{E}[\theta_2] = 0$, $\text{SD}[\theta_2] = \sqrt{2 \cdot 10^4 b^2 + 1}$

```{r, echo = FALSE}
sample_mean <- colMeans(samples)
sample_sd <- apply(samples, 2, sd)

true_mean <- c(0, 0)
true_sd <- c(10, sqrt(2e4 * b_param^2 + 1))

comparison <- data.frame(
  Parameter = c("theta1", "theta2"),
  MH_Mean = sample_mean,
  True_Mean = true_mean,
  MH_SD = sample_sd,
  True_SD = true_sd,
  row.names = NULL
)
comparison
```

The Metropolis-Hastings estimates closely match the analytical moments,
confirming the correctness of our implementation.

We can visualize the samples against the true density contours:

```{r, echo = FALSE, fig.width = 7, fig.height = 5}
par(mar = c(4, 4, 2, 1))
contour(
  x1_grid, x2_grid, exp(z), nlevels = 15,
  xlab = expression(theta[1]), ylab = expression(theta[2]),
  main = "MH Samples vs True Density", col = "grey60"
)
points(
  samples[, 1], samples[, 2],
  pch = 16, cex = 0.3, col = adjustcolor("steelblue", 0.4)
)
```

And overlay the analytical marginal densities on histograms of the samples.

```{r, echo = FALSE, fig.width = 7, fig.height = 4}
marginal_theta2 <- Vectorize(function(t2) {
  integrate(function(t1) {
    dnorm(t2, mean = b_param * t1^2 - 100 * b_param, sd = 1) *
      dnorm(t1, mean = 0, sd = 10)
  }, lower = -30, upper = 30)$value
})

par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))

hist(
  samples[, 1], breaks = 40, probability = TRUE,
  main = expression(theta[1]),
  xlab = expression(theta[1]), col = "lightblue", border = "white"
)
curve(
  dnorm(x, mean = 0, sd = 10),
  add = TRUE, col = "red", lwd = 2
)
legend("topright", "Analytical", col = "red", lwd = 2, bty = "n", cex = 0.8)

hist(
  samples[, 2], breaks = 40, probability = TRUE,
  main = expression(theta[2]),
  xlab = expression(theta[2]), col = "lightblue", border = "white"
)
curve(marginal_theta2, add = TRUE, col = "red", lwd = 2, n = 200)
legend("topright", "Analytical", col = "red", lwd = 2, bty = "n", cex = 0.8)
```