This vignette summarizes various important considerations in order to write efficient programs using {anvl}. Some of the topics are covered in more depth elsewhere, but here we gather everything in one place, albeit sometimes only briefly.
f32 (the default) instead of f64.In eager mode we don’t compile the whole program but compile many tiny XLA programs. This is more convenient to debug as you can inspect intermediate results and step through the program with the R debugger. This comes at a performance cost, however. Specifically:
jit()ting the whole function, you only have the
per-call overhead once.To get the best performance, you will therefore usually want to
jit() the whole function. There might be some exceptions,
as compiling one large function can sometimes be more costly than
compiling several smaller ones, even though splitting it up gives the
compiler less room for joint optimization.
Provided your computation is large enough to be worth shipping over to the GPU, running on a CUDA GPU is the single biggest speed-up {anvl} offers – typically 10x to 100x over CPU on linear algebra and large elementwise work.
GPU support currently only works on Linux (amd64/x86-64) or via WSL2
on Windows. See the GPU
Installation section of the installation vignette for setup; once
that’s done, any AnvlArray constructor and
jit() accept device = "cuda":
library(anvl)
x <- nv_array(1:8, dtype = "f32", device = "cuda")
y <- nv_array(9:16, dtype = "f32", device = "cuda")
x + y
# Or pin a jitted function to a specific GPU:
f_cuda <- jit(function(x, y) x + y, device = "cuda:0")A few things to keep in mind when moving to GPU:
as_array() (which
copies the value back into R) inside loops you want to run fast (see Asynchronous execution below).GPUs get their speed from running thousands of operations in parallel, so they reward code that does a lot of work per call on large arrays. A few rules of thumb:
x * y or nv_reduce_sum(x)
instead of a for loop).nv_matmul() accept a full batch of matrices at once.
Stacking many small inputs into one larger array and processing them in
a single call is typically much faster than calling the same operation
many times. This mirrors plain R, where vectorized code beats a
for loop – except that on a GPU the win comes from running
the batch entries in parallel rather than from avoiding interpreter
overhead.The data type of an AnvlArray (f32,
f64, i32, i64, …) affects both
how much memory it uses and how fast operations on it run. The default
is to use f32 for R doubles and
i32 for R integers. Currently, we don’t
support other floating-point data types such as f16 or
bf16. There is of course a trade-off between efficiency and
precision here. Speedups from using f32 over
f64 also depend on the specific hardware. Commonly,
f32 is the default data type for floating-point operations
in GPU-accelerated frameworks, so unless you have a specific reason, we
recommend sticking with the defaults.
A handful of linear-algebra primitives – nv_svd(),
nv_eigh(), nv_qr(), nv_lu() – use
R’s BLAS / LAPACK on the CPU backend, so the performance of these
operations is determined by how R was built.
The default R install on many platforms ships with a reference BLAS / LAPACK that is single-threaded and considerably slower than tuned alternatives such as OpenBLAS, Intel MKL, or Apple’s Accelerate framework. The CUDA backend is unaffected.
Compiling an {anvl} function can take anywhere from milliseconds to seconds (or even minutes) depending on the size of the function. The compilation cache reuses the compiled function across calls with matching input types – see The compilation cache in the JIT Deep Dive for the details. What follows is the complementary trick of reshaping inputs so that more calls land on the same cache entry.
Every new combination of input shapes triggers a fresh recompile. If the input shapes vary often, this can considerably degrade performance and you might spend more time compiling than computing.
A common workaround is to round each varying dimension up to a fixed bucket size and pad the data to that size. All calls then share one compiled function, at the cost of doing some wasted work on the padded positions.
For example, suppose we want to sum a vector whose length varies between calls:
sum_jit <- jit(function(x) {
cat("compiling for length ", shape(x), "\n", sep = "")
nv_reduce_sum(x, dims = 1L)
})
for (n in c(7, 11, 13, 17)) {
sum_jit(nv_array(seq_len(n), dtype = "f32"))
}
#> compiling for length 7
#> compiling for length 11
#> compiling for length 13
#> compiling for length 17Each new length triggers a recompile. By padding each input up to the next multiple of 32 with zeros, all four calls share one compiled function:
pad_to_multiple <- function(x, m) {
n <- length(x)
c(x, rep(0, ceiling(n / m) * m - n))
}
for (n in c(7, 11, 13, 17)) {
sum_jit(nv_array(pad_to_multiple(seq_len(n), 32), dtype = "f32"))
}
#> compiling for length 32Here, we only compile once for input size 32 and then reuse the
compiled executable. One common strategy is to split up the varying
lengths into buckets of size 2^n to balance recompilation
with wasted computation. The example above only works because the padded
zeros are the neutral value of the performed computation
(addition). For computations where the pad value is not neutral –
e.g. the mean, where the padded positions would skew the average – the
padded entries have to be masked out explicitly. See the Static Shape Restriction vignette for the
masking patterns and a table of neutral values for common
reductions.
One surprising fact about {anvl} is that calls into compiled functions (as long as we have a cache hit) return almost immediately. In R terms, we would say what is actually returned is a promise. The actual computation keeps running in the background while the R interpreter can do something else. This is especially important when working with a GPU, because we want to use both CPU and GPU simultaneously. But it also matters on CPU as XLA runs many threads in parallel, while R itself is single-threaded. Note that you need to be aware of this when benchmarking your {anvl} functions.
The natural question is how we avoid reading wrong results when
functions always return immediately, before the computation has
finished. This works out because {anvl} will await the computation
whenever it actually needs to read the data. This happens, for example,
when we print the array or convert it back to R via
as_array(). However, some operations such as accessing
dtype(), shape() or device() do
not require awaiting the result.
One classic mistake that can degrade GPU performance is to always
force this synchronization. For example, in a typical ML training loop,
we have some batch preparation that runs on the CPU and then some
expensive computation that runs on the GPU. Ideally, the CPU and the GPU
run simultaneously. This is what will happen when you write code like
the one below. Because the GPU call returns immediately, the R loop will
continue preparing the next batch while the GPU computation from the
previous iteration is still running. In the next iteration we will call
into the step() function again and the computation will be
scheduled to run on the GPU. Note that the GPU computation from the
previous step() call might not yet be done at this point.
This is exactly the situation we want to be in, as we don’t want the GPU
to idle.
for (i in seq_len(n_steps)) {
batch <- prepare_batch(i) # CPU-work
weights <- step(batch$X, weights, batch$y) # GPU-work
}One way to make this slower is to always print the result after each
step():
# Bad: blocks every step to print the loss.
for (i in seq_len(n_steps)) {
batch <- prepare_batch(i) # CPU-work
weights <- step(batch$X, weights, batch$y) # GPU-work
print(loss(batch$X, weights, batch$y))
}Because print() requires the actual data, we now have to
wait for the GPU to finish its computation so the result can be sent
back to the CPU. But this means that while the next batch is being
prepared, the GPU will be idle. Solutions include logging less often, or
printing the loss from the previous iteration.
Value semantics and the fact that subset-assignment always copies
were introduced in the Get Started
vignette. This section covers the one extra lever jit()
gives you: telling XLA it can reuse an input buffer for the output.
Inside a jitted function XLA avoids unnecessary copies on its own.
Across the jit boundary, though, it cannot tell whether the R caller
still needs the input arrays, so by default it has to leave them intact.
If you know a particular input will not be used after the call, you can
pass its name to donate so XLA is free to reuse its memory
for the output:
update <- jit(function(x, delta) x + delta, donate = "x")
x <- nv_array(c(1, 2, 3), dtype = "f32")
x <- update(x, nv_array(c(0.1, 0.1, 0.1), dtype = "f32"))
x
#> AnvlArray
#> 1.1000
#> 2.1000
#> 3.1000
#> [ CPUf32{3} ]One common scenario for this pattern is weight updates in training loops.
In plain R, y[i] <- val may happen in place when
y has only one reference. Unfortunately, this is not
possible in {anvl} and evaluating such calls in eager mode (outside of
jit()) will always produce a copy. This is because we are
essentially calling into a jit()ted function and we can’t
always assume it’s okay to donate. In the future, we might add an option
to do this in place.