---
title: "Efficiency"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Efficiency}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(anvl)
```

This vignette summarizes various important considerations in order to write efficient programs using {anvl}.
Some of the topics are covered in more depth elsewhere, but here we gather everything in one place, albeit sometimes only briefly.

- [**Eager vs. JIT**](#eager-vs-jit) -- comparison of the two execution modes.
- [**CUDA**](#cuda) -- running on a GPU instead of CPU (Linux x86 / WSL2 only).
- [**Data types**](#data-types) -- using `f32` (the default) instead of `f64`.
- [**BLAS / LAPACK**](#blas-lapack) -- linking R against a fast LAPACK to speed up linear algebra on CPU.
- [**Compilation cost**](#compilation-cost) -- padding inputs so more calls hit the same cache entry.
- [**Asynchronous execution**](#asynchronous-execution) -- how to keep the device (e.g. a GPU) busy while R prepares the next call.
- [**Memory**](#memory) -- avoiding unnecessary copies via donation and jit-internal optimization.

## Eager vs. JIT

In eager mode we don't compile the whole program but compile many tiny XLA programs.
This is more convenient to debug as you can inspect intermediate results and step through the program with the R debugger.
This comes at a performance cost, however.
Specifically:

1. More per-primitive call overhead (such as launching a kernel on the GPU).
   When `jit()`ting the whole function, you only have the per-call overhead once.
2. The compiler has less context for optimization, as it compiles many small programs individually instead of one large program.

To get the best performance, you will therefore usually want to `jit()` the whole function.
There might be some exceptions, as compiling one large function can sometimes be more costly than compiling several smaller ones, even though splitting it up gives the compiler less room for joint optimization.

## CUDA

Provided your computation is large enough to be worth shipping over to the GPU, running on a CUDA GPU is the single biggest speed-up {anvl} offers -- typically 10x to 100x over CPU on linear algebra and large elementwise work.

GPU support currently only works on Linux (amd64/x86-64) or via WSL2 on Windows.
See the [GPU Installation](installation.html#gpu-installation) section of the installation vignette for setup; once that's done, any `AnvlArray` constructor and `jit()` accept `device = "cuda"`:

```{r, eval = FALSE}
library(anvl)
x <- nv_array(1:8, dtype = "f32", device = "cuda")
y <- nv_array(9:16, dtype = "f32", device = "cuda")
x + y

# Or pin a jitted function to a specific GPU:
f_cuda <- jit(function(x, y) x + y, device = "cuda:0")
```

A few things to keep in mind when moving to GPU:

- **Copying data between R and the GPU is slow.** Each transfer has a fixed overhead plus a per-byte cost. Once you've moved an array onto the GPU, keep it there; avoid `as_array()` (which copies the value back into R) inside loops you want to run fast (see [Asynchronous execution](#asynchronous-execution) below).
- **Each call to the GPU has more overhead than a CPU call.** Small operations on small arrays can actually be *slower* on GPU than on CPU because the per-call overhead dominates the actual work.

### Writing GPU-friendly code

GPUs get their speed from running thousands of operations in parallel, so they reward code that does a lot of work per call on large arrays.
A few rules of thumb:

- **Avoid R-level loops over array elements or rows.** Each iteration launches a separate kernel with non-trivial overhead, and the kernels run one after another instead of in parallel. Express the work as a single vectorized operation on the whole array whenever possible (e.g. `x * y` or `nv_reduce_sum(x)` instead of a `for` loop).
- **Prefer batched operations.** Operations like `nv_matmul()` accept a full batch of matrices at once. Stacking many small inputs into one larger array and processing them in a single call is typically much faster than calling the same operation many times. This mirrors plain R, where vectorized code beats a `for` loop -- except that on a GPU the win comes from running the batch entries in parallel rather than from avoiding interpreter overhead.
- **Sometimes the algorithm itself has to be rewritten.** Inherently sequential formulations -- e.g. a running update that depends on the previous step -- can leave most of the GPU idle. Reformulating the computation to expose parallelism can be dramatically faster, even when the parallel version does more total work.

## Data types

The data type of an `AnvlArray` (`f32`, `f64`, `i32`, `i64`, ...) affects both how much memory it uses and how fast operations on it run.
The default is to use `f32` for R `double`s and `i32` for R `integer`s.
Currently, we don't support other floating-point data types such as `f16` or `bf16`.
There is of course a trade-off between efficiency and precision here.
Speedups from using `f32` over `f64` also depend on the specific hardware.
Commonly, `f32` is the default data type for floating-point operations in GPU-accelerated frameworks, so unless you have a specific reason, we recommend sticking with the defaults.

## BLAS / LAPACK

A handful of linear-algebra primitives -- `nv_svd()`, `nv_eigh()`, `nv_qr()`, `nv_lu()` -- use R's BLAS / LAPACK on the CPU backend, so the performance of these operations is determined by how R was built.

The default R install on many platforms ships with a reference BLAS / LAPACK that is single-threaded and considerably slower than tuned alternatives such as OpenBLAS, Intel MKL, or Apple's Accelerate framework.
The CUDA backend is unaffected.

## Compilation cost

Compiling an {anvl} function can take anywhere from milliseconds to seconds (or even minutes) depending on the size of the function.
The compilation cache reuses the compiled function across calls with matching input types -- see [The compilation cache](jit.html#the-compilation-cache) in the JIT Deep Dive for the details.
What follows is the complementary trick of reshaping inputs so that more calls land on the same cache entry.

### Padding inputs to avoid recompilation

Every new combination of input shapes triggers a fresh recompile.
If the input shapes vary often, this can considerably degrade performance and you might spend more time compiling than computing.

A common workaround is to *round each varying dimension up to a fixed bucket size and pad* the data to that size.
All calls then share one compiled function, at the cost of doing some wasted work on the padded positions.

For example, suppose we want to sum a vector whose length varies between calls:

```{r}
sum_jit <- jit(function(x) {
  cat("compiling for length ", shape(x), "\n", sep = "")
  nv_reduce_sum(x, dims = 1L)
})

for (n in c(7, 11, 13, 17)) {
  sum_jit(nv_array(seq_len(n), dtype = "f32"))
}
```

Each new length triggers a recompile.
By padding each input up to the next multiple of 32 with zeros, all four calls share one compiled function:

```{r}
pad_to_multiple <- function(x, m) {
  n <- length(x)
  c(x, rep(0, ceiling(n / m) * m - n))
}

for (n in c(7, 11, 13, 17)) {
  sum_jit(nv_array(pad_to_multiple(seq_len(n), 32), dtype = "f32"))
}
```

Here, we only compile once for input size 32 and then reuse the compiled executable.
One common strategy is to split up the varying lengths into buckets of size `2^n` to balance recompilation with wasted computation.
The example above only works because the padded zeros are the *neutral value* of the performed computation (addition).
For computations where the pad value is not neutral -- e.g. the mean, where the padded positions would skew the average -- the padded entries have to be masked out explicitly.
See the [Static Shape Restriction](static_shapes.html) vignette for the masking patterns and a table of neutral values for common reductions.

## Asynchronous execution

One surprising fact about {anvl} is that calls into compiled functions (as long as we have a cache hit) return almost immediately.
In R terms, we would say what is actually returned is a promise.
The actual computation keeps running in the background while the R interpreter can do something else.
This is especially important when working with a GPU, because we want to use both CPU and GPU simultaneously.
But it also matters on CPU as XLA runs many threads in parallel, while R itself is single-threaded.
Note that you need to be aware of this when benchmarking your {anvl} functions.

The natural question is how we avoid reading wrong results when functions always return immediately, before the computation has finished.
This works out because {anvl} will await the computation whenever it actually needs to read the data.
This happens, for example, when we print the array or convert it back to R via `as_array()`.
However, some operations such as accessing `dtype()`, `shape()` or `device()` do not require awaiting the result.

One classic mistake that can degrade GPU performance is to always force this synchronization.
For example, in a typical ML training loop, we have some batch preparation that runs on the CPU and then some expensive computation that runs on the GPU.
Ideally, the CPU and the GPU run simultaneously.
This is what will happen when you write code like the one below.
Because the GPU call returns immediately, the R loop will continue preparing the next batch while the GPU computation from the previous iteration is still running.
In the next iteration we will call into the `step()` function again and the computation will be scheduled to run on the GPU.
Note that the GPU computation from the previous `step()` call might not yet be done at this point.
This is exactly the situation we want to be in, as we don't want the GPU to idle.

```{r, eval = FALSE}
for (i in seq_len(n_steps)) {
  batch   <- prepare_batch(i)                # CPU-work
  weights <- step(batch$X, weights, batch$y) # GPU-work
}
```

One way to make this slower is to always print the result after each `step()`:

```{r, eval = FALSE}
# Bad: blocks every step to print the loss.
for (i in seq_len(n_steps)) {
  batch <- prepare_batch(i)                # CPU-work
  weights <- step(batch$X, weights, batch$y) # GPU-work
  print(loss(batch$X, weights, batch$y))
}
```

Because `print()` requires the actual data, we now have to wait for the GPU to finish its computation so the result can be sent back to the CPU.
But this means that while the next batch is being prepared, the GPU will be idle.
Solutions include logging less often, or printing the loss from the previous iteration.

## Memory

Value semantics and the fact that subset-assignment always copies were introduced in the [Get Started vignette](anvl.html#the-anvlarray).
This section covers the one extra lever `jit()` gives you: telling XLA it can reuse an input buffer for the output.

### Donation

Inside a jitted function XLA avoids unnecessary copies on its own.
Across the jit boundary, though, it cannot tell whether the R caller still needs the input arrays, so by default it has to leave them intact.
If you know a particular input will not be used after the call, you can pass its name to `donate` so XLA is free to reuse its memory for the output:

```{r}
update <- jit(function(x, delta) x + delta, donate = "x")

x <- nv_array(c(1, 2, 3), dtype = "f32")
x <- update(x, nv_array(c(0.1, 0.1, 0.1), dtype = "f32"))
x
```

One common scenario for this pattern is weight updates in training loops.

### Eager-mode subset-assignment always copies

In plain R, `y[i] <- val` may happen in place when `y` has only one reference.
Unfortunately, this is not possible in {anvl} and evaluating such calls in eager mode (outside of `jit()`) will always produce a copy.
This is because we are essentially calling into a `jit()`ted function and we can't always assume it's okay to donate.
In the future, we might add an option to do this in place.