--- title: "Logistic Regression" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Logistic Regression} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- In this vignette, we will implement logistic regression from scratch. ## Problem Logistic regression models the probability that an observation belongs to a particular class. Given a feature matrix $X$ and a weight vector $\beta$, the predicted probability is: $$P(y = 1 | X) = \sigma(X \beta + \alpha)$$ where $\sigma$ is the logistic function $\sigma(z) = \frac{1}{1 + e^{-z}}$ and $\alpha$ is the bias term. In practice, logistic regression is typically fitted using Iteratively Reweighted Least Squares (IRLS), which is a form of Newton's method. Here, we will use a simple gradient descept algorithm for didactic purposes. ## Data We'll use the classic Titanic dataset from base R, which contains survival data for the most famous shipwreck in history. The `Titanic` object is a contingency table, so we first expand it into individual observations. ```{r} library(anvl) set.seed(42) # Expand contingency table to individual observations titanic_df <- as.data.frame(Titanic) titanic <- titanic_df[rep(seq_len(nrow(titanic_df)), titanic_df$Freq), 1:4] rownames(titanic) <- NULL # Create design matrix with dummy variables X <- model.matrix(~ Class + Sex + Age, data = titanic)[, -1] y <- as.integer(titanic$Survived == "Yes") # Standardize features for better gradient descent convergence X <- scale(X) n <- nrow(X) p <- ncol(X) ``` The dataset contains `r n` observations and `r p` predictor variables. Our goal is to predict survival (`Survived`) based on three categorical features: - **Class**: Passenger class (1st, 2nd, 3rd, or Crew) - encoded as dummy variables with 1st class as the reference category - **Sex**: Male or Female - encoded with Male as the reference category - **Age**: Child or Adult - encoded with Child as the reference category ```{r} summary(titanic) ``` The general survival rate was `r sprintf("%.2f", mean(y) * 100)`%. Now we convert the data to `AnvlArray`s. ```{r} X_tensor <- nv_array(X, dtype = "f32") y_tensor <- nv_array(y, dtype = "f32", shape = c(n, 1L)) ``` ## Model The logistic regression model consists of computing the linear combination of features and then applying the logistic function. ```{r} predict_proba <- function(X, beta, alpha) { logits <- X %*% beta + alpha nv_logistic(logits) } ``` For binary classification, we use the binary cross-entropy loss: $$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]$$ We need to be careful with numerical stability when computing the logarithm of probabilities close to 0 or 1. We'll add a small epsilon to avoid taking the log of exactly 0. ```{r} binary_cross_entropy <- function(y_true, y_pred) { eps <- 1e-7 y_pred_clipped <- nv_clamp(eps, y_pred, 1 - eps) loss <- -(y_true * log(y_pred_clipped) + (1 - y_true) * log(1 - y_pred_clipped)) mean(loss) } ``` We combine the prediction and loss computation into a single function. ```{r} model_loss <- function(X, y, beta, alpha) { y_pred <- predict_proba(X, beta, alpha) binary_cross_entropy(y, y_pred) } ``` Using {anvl}'s automatic differentiation, we can obtain the gradients of the loss with respect to the model parameters. ```{r} model_loss_grad <- gradient(model_loss, wrt = c("beta", "alpha")) ``` ## Training We will implement the training loop using `nv_while()`. This keeps the entire training loop within a single compiled function, which is more efficient than repeatedly calling a JIT-compiled function from R, especially for small models. ```{r} fit_logreg <- jit(function(X, y, beta, alpha, n_epochs, lr) { output <- nv_while( list(beta = beta, alpha = alpha, epoch = nv_scalar(0L)), \(beta, alpha, epoch) epoch < n_epochs, \(beta, alpha, epoch) { grads <- model_loss_grad(X, y, beta, alpha) list( beta = beta - lr * grads$beta, alpha = alpha - lr * grads$alpha, epoch = epoch + 1L ) } ) list(beta = output$beta, alpha = output$alpha) }) ``` We initialize the parameters and train the model with a single function call. ```{r} beta_init <- nv_array(rnorm(p), dtype = "f32", shape = c(p, 1L)) alpha_init <- nv_scalar(0, dtype = "f32") result <- fit_logreg( X_tensor, y_tensor, beta_init, alpha_init, nv_scalar(50000L), nv_scalar(0.1) ) result ``` Let's verify our implementation by comparing with R's built-in `glm()`. ```{r} glm_fit <- glm(y ~ X, family = binomial) ``` Now let's compare the coefficients: ```{r, echo = FALSE} anvl_coef <- c(as_array(result$alpha), as.vector(result$beta)) glm_coef <- coef(glm_fit) # Clean up parameter names (remove "X" prefix from glm output) param_names <- gsub("^X", "", names(glm_coef)) comparison <- data.frame( Parameter = param_names, anvl = anvl_coef, glm = glm_coef, row.names = NULL ) comparison ``` The estimates are very close, confirming that our gradient descent implementation converges to the same solution as `glm()`.