Brian Lawrence

Optimizing matrix multiplciation

2026-04-01T00:00:00+00:00

Optimizing matrix multiplication on an RTX 3050

I was working on mytorch, a mock tensor library for GPU, and I decided to spend some time optimizing the matrix multiplication function. After all, matrix multiplication is the most expensive operation in deep learning, and optimizing batched matrix multiplication is something of a rite of passage in performance computing.

If you’re writing a production tensor library like Pytorch, your code needs to be fast in many different settings:

on CPUs and GPUs of various architectures
with tensors (batches of matrices) of various sizes
with matrices in row- or column-major order
with matrices that might not be contiguous in memory.

That’s a lot of engineering. I decided to start by optimizing a single task: I want to see how fast I can perform a batched matrix multiplication on two 32-bit float tensors of shape (100, 1000, 1000), in row-major order, on an RTX 3050 GPU.

Useful reading

I drew inspiration from a couple of terrific blog posts.

Some benchmarks

CUDA provides a highly optimized library for the Basic Linear Algebra Subroutines (BLAS), called cublas. Batched matrix multiplication is one of the standard operations in BLAS. Using cublas, our computation takes 68-72 ms. (As always, the benchmark varies slightly from run to run, and somewhat more from day to day.)

My first attempt to do the job with no optimizations will take 840 ms, about 12 times slower than cublas.

As of this writing, I’ve sped the calculation up to 72 ms, achieving parity (or very close to it) with CUBLAS. I’m working to get further improvement.

Baseline analysis

The RTX 3050 has 18 streaming multiprocessors with 128 cores each, for a total of 2304 cores. It runs at a boost clock speed of 1.47 GHz, for a total of ~3.39T cycle-cores per second. At two floating-point operations per clock cycle, the chip can achieve 6.77 TFLOPS.

The matrix multiplication requires 2*100*1000*1000*1000 = 200 billion floating-point operations.
Assuming 6.774 TFLOPS (and zero overhead) gives a time of 30 ms to perform the batched matrix multiplication. I’ll take 30 ms as a theoretical maximum speed. Of course, this doesn’t take into account the time required for memory access and any other computational overhead.

In my experiments, cublas takes just over twice this estimated theoretical best time.

A simple first attempt

I put together a simple CUDA C++ implementation of batched matrix multiplication on pytorch-style tensors, with arbitrary dimension, shape and stride.

I assigned each thread to compute a single entry of the output tensor (so our example will spin up 100 million threads), and I packed 1024 threads per block, the maximum value (for a total of about 98,000 blocks).

The calculation took ~840 ms, 28 times the theoretical best time, and 12 times slower than cublas.

Looks like we have some optimizing to do.

First thoughts

The kernel is probably memory-bound, not compute-bound. It requires a total of 200 billion floating-point operations, but 800 GB of memory reads. (At each step in the matrix-multiplication loop, the kernel reads one float from each matrix, and does one multiplication and one addition: two FLOPs and 8 bytes.) Memory reads are much more expensive than compute: the chip can compute about 6 TFLOPS, but memory bandwidth is only about 168 GB/s. The memory performance will be slightly faster due to caching, but I still expect memory to be the bottleneck.

The natural idea is to try thread coarsening or tiling. I want to load some input data once and compute on it many times. If one thread is responsible for computing a small box of matrix entries, rather than just one, that thread can use its memory access more efficiently. And by using shared memory (low-latency memory which is shared among all threads in a block) I can arrange for several threads to collaborate using data that is only loaded (from slow global memory) once.

But first I want to pick some low-hanging fruit.

Cutting down on shape-and-stride calculations

The kernel includes logic to handle arbitrary shapes and strides. I think this logic is imposing a lot of unnecessary cost.

To start with, the shape-and-stride calculation is being done in the kernel: each of my 100 million threads is reading the shapes and strides of the input tensors and computing the index of the one entry it needs to access. That’s a lot of repeated calculation. Worse yet, the shapes and strides are stored in vectors whose length (the dimension of the tensor) is unknown at compile time. This means that the vectors are stored in global memory, resulting in a lot of unnecessary memory access. (Actually, since these values are accessed so often, they are probably stored in a low-level cache…)

Most matrix multiplication in practice works on batched matrices with a simple structure: the matrices are contiguous in memory, in row-major order; and the “matrix dimensions” are the two dimensions with the smallest strides. So, I’ll try to optimize a simple case first: a batched matrix product of two three-dimensional (batch, row, col) tensors.

In any case, I wrote a new kernel matmul_3d() that assumes its inputs are contiguous three-dimensional tensors, and accepts the shape directly as argument to the function. The result: from 840ms down to 800ms.

OK, let’s try to get more economies of scale.

Profiling

Nvidia provides a powerful profiler, ncu. The profiler shows all sorts of metrics, including memory throughput, cache (L1 and L2) throughput, compute throughput, occupancy and workload statistics… It even offers suggestions for optimization.

In my experience, at this stage, it’s best to focus on writing efficient code. The profiler is a great tool later on, when I have specific questions about resource usage.

Indeed, this profiler’s first suggestion is that I change the number of threads per block to increase occupancy. That’s not the biggest priority at this point – and indeed, if I make the change the profiler suggests, performance gets worse.

So: I’m going to ignore ncu’s advice and get back to writing solid code. The big bottleneck is memory access, so that’s where I’m going to focus. But the profiler will come back later on.

Improving memory efficiency

To start with, I’m going to make two improvements to the kernel.

Load inputs into shared memory, in batches, and
make each thread responsible for more than one output entry.

I’m going to make configurable parameters for:

TM and TN – these determine how many output values each thread will calculate;
TPBM and TPBN – these determine how many threads in a block; and
BK – the multiplication loop size.

This last parameter needs some explanation. Each block of threads is responsible for computing a (TM*TPBM) by (TN*TPBN) submatrix of the output matrix. To do this it will need to access some number of full rows of the first input, and some number of full columns of the second; then it will loop over the columns of the first input (and the rows of the second).

There might not be enough room in shared memory for all the rows and columns that need to be loaded. So, instead of being loaded in full, they will be loaded in blocks of BK. In other words: matrix multiplication involves a summation over the intermediate dimension; we will break that summation into chunks of size BK, and compute partial sums one chunk at a time.

Improving memory access patterns

Global memory is stored in DRAM; shared memory (and the L1 cache) are stored in SRAM. DRAM reads memory in consecutive 32-byte chunks; if I don’t use all 32 bytes, I’m wasting bandwidth. SRAM memory is stored in 32 banks (each 4 bytes wide – so for example bank 0 is responible for addresses 0, 1, 2, 3 modulo 128). In a single read, SRAM can read any 4-bite word from each of its 32 banks, independently. So we want each thread in a warp to try to read from a different bank. If multiple threads request data from the same bank, the result is a “bank conflict”: the SRAM will have to perform multiple physical reads before the result can be returned.

In both situations, a good pattern is for the 32 threads in a warp to access consecutive floats in memory:

int idx = threadIdx.x;
data[idx] ... .

First, I’ll make sure the number of rows in each block of threads is 32 (at least when the matrices have >= 32 rows); this means each warp is exactly one row.

Now let’s plan how to arrange memory and threads. As far as memory:

The input tensors are already laid out contiguous in row-major order, we can’t change that;
The result tensor is also in row-major order; we can’t touch it either;
But the “shared” tensors (copies of tiles of input tensors that reside in shared memory) can be arranged how we like.

And as far as thread arrangement, we have to decide how to divide up each of these three operations among threads in the block:

Copy input tensor a into shared memory;
Copy input tensor b into shared memory;
Perform the “multiplication loop” to compute entries of output tensor.

Let’s start with the entries of the output tensor. Conceptually it looks something like the following.

float cml_sum = 0.0f;
for (int loop_idx = 0; ... ) {
    cml_sum += a_shared[row][loop_idx] * b_shared[loop_idx][col];
}
result[row][col] += cml_sum;

A natural choice is to have consecutive threads operate on the same “row” and consecutive “col”: this way the global memory writes at line 4 are efficient, with all 32 threads in the warp writing to one 128-bit line of global memory (or two lines, if the alignment isn’t right). Assuming b_shared is stored in row-major order, the shared memory reads in line 2 are good as well: all threads read the same entry from a_shared, which is efficient (it’s called “broadcasting”), and the 32 threads write to 32 consecutive entries of b_shared.

As for copying global ‘a’ and ‘b’ into shared ‘a_shared’ and ‘b_shared’: it’s the same idea. I store ‘a_shared’ and ‘b_shared’ in row-major order, so data that is contiguous in ‘a’ is also contiguous in ‘a_shared’. Then I arrange for all the threads in the block to handle consecutive floats, one float each.

Thread-local results in registers

The kernel is still suffering from global memory writes inside a tight loop: if you look back up at the “conceptual” code, the line

result[row][col] += cml_sum;

entails a read and a write to slow global DRAM.

Better: store per-thread results in registers until the calculation is done. In pseudocode:

float tmp[TM][TN];
for (k ...) {
    for (m_ctr = 0; m_ctr < TM; m_ctr++) {
        for (n_ctr = 0; n_ctr < TN; n_ctr++) {
            tmp[m_ctr][n_ctr] += A_s[m][k] * B_s[k][n];
        }
    }
}

for (m_ctr = 0; m_ctr < TM; m_ctr++) {
    for (n_ctr = 0; n_ctr < TN; n_ctr++) {
        result[...] = tmp[...];
    }
}

Parameter tuning

At this point I have a bunch of different parameters:

the dimensions of a tile TM, TN;
the number of tiles per block (in row and column dimensions);
the number of $k$ to loop over.

After some manual experimentation with these parameters I find the following values.

#define TM (8)
#define TN (8)
#define TPBM (16)
#define TPBN (16)
#define BM (TM * TPBM)
#define BN (TN * TPBN)
#define BK (16)

My batched matrix multiplication is now at 107 ms. That’s still about 1.5x the time the CUBLAS kernel takes, but I have one more trick up my sleeve.

Vectorized loads

GPUs offer a single instruction to read 128 bits (= 16 bytes = 4 floats) at once. The catch is that the read address has to be 16-byte aligned. If you promise the compiler that your address is 16-byte aligned, the compiler will give you vectorized reads and writes. One easy way to do this, syntactically, is with the float4 type, which represents a vector (an array?) – anyway, it represents 4 floats in a row, aligned to a multiple of 16 bytes.

Here’s some mock code to show how to do it: if you’re copying a bunch of floats

float *A, *B;
for (int i = 0; i < N; i++) {
    A[i] = B[i];
}

you can simply cast the pointers from float* to float4*, like so:

// gentle reminder that things need to be aligned
assert((uintptr_t) A % 16 == 0);

// and that I'm not handling edge effects
assert(N % 4 == 0);

for (int i = 0; i < N; i += 4) {
    *reinterpret_cast<float4*>(&A[i]) = *reinterpret_cast<float4*>(&B[i]);
}

In other words, we’ve replaced $N$ assignments float = float with $N/4$ assignments float4 = float4.

And here’s what the change looks like in my kernel. There’s all sorts of messy indexing going on thanks to tiling in various dimensions, but notice how simple it is to change plain float loads to vectorized float4 loads: I really just multiply the step size by 4 and hit the pointers with reinterpret_cast.

Before:

// A_s[i, j] = A[a_row + i, k + j]
for (size_t a_idx = threadIdx.x; a_idx < BM * BK; a_idx += blockDim.x) {
  size_t i = a_idx / BK;
  size_t j = a_idx % BK;
  if (k0 + j < K && i + a_row_base < M) {
    A_s[i * BK + j] = A[i * (a.shape[2]) + k0 + j];
  } else {
    A_s[i * BK + j] = 0;
  }
}

After:

// A_s[i, j] = A[a_row + i, k + j]
for (size_t a_idx = 4 * threadIdx.x; a_idx < BM * BK; a_idx += 4 * blockDim.x) {
  size_t i = a_idx / BK;
  size_t j = a_idx % BK;
  if (k0 + j < K && i + a_row_base < M) {
    *reinterpret_cast<float4*>(&A_s[i * BK + j]) = *reinterpret_cast<float4*>(&A[i * (a.shape[2]) + k0 + j]);
  } else {
    A_s[i * BK + j] = 0;
    A_s[i * BK + j + 1] = 0;
    A_s[i * BK + j + 2] = 0;
    A_s[i * BK + j + 3] = 0;
  }
}

Of course, there’s some additional boilerplate to check bounds and alignment. Nvidia guarantees that cudaMalloc returns 256-byte aligned memory, but we still need to worry about the dimensions of the matrix (which could be arbitrary), and we’d better make sure we choose the parameter $BK$ to be a multiple of 4. I’m deliberately running this initial benchmark on 1000-by-1000 matrices so I don’t have to worry about edge effects.

Vectorized loads from global into shared memory shave bring the runtime down to 72 ms: we’ve achieved parity with CUBLAS (on a good day).

The existing kernel, in code

Here it is.

__global__ void matmul_tiled_2(ContiguousTensor3d_Device a,
                               ContiguousTensor3d_Device b,
                               ContiguousTensor3d_Device res) {
  // blocks: (batch, row, col)
  // threads: (n_threads, 1, 1)  -- I will manage indexing myself

  size_t M = a.shape[1], K = a.shape[2], N = b.shape[2];
  size_t a_row_base = BM * blockIdx.y;
  size_t b_col_base = BN * blockIdx.z;
  size_t batch_idx = blockIdx.x;
  float *A =
      a.data + batch_idx * a.shape[1] * a.shape[2] + a_row_base * a.shape[2];
  float *B = b.data + batch_idx * b.shape[1] * b.shape[2] + b_col_base;
  float *RES = res.data + batch_idx * res.shape[1] * res.shape[2] +
               a_row_base * res.shape[2] + b_col_base;

  __shared__ float A_s[BM * BK];
  __shared__ float B_s[BK * BN];

  // store results in registers
  // each thread will be responsible for TM * TN entries of C...
  // TM rows and TN cols, the rows strided every TPBM, the cols strided every
  // TPBN
  float tmp[TM * TN] = {0};

  size_t this_thread_row = threadIdx.x / TPBN;
  size_t this_thread_col = threadIdx.x % TPBN;

  for (size_t k0 = 0; k0 < K; k0 += BK) {
    // load a and b
    // A_s[i, j] = A[a_row + i, k + j]

    // values to fill: BM * BK
    // threads: TPB
    // want: blockDim.x div. by BK
    for (size_t a_idx = 4 * threadIdx.x; a_idx < BM * BK; a_idx += 4 * blockDim.x) {
      size_t i = a_idx / BK;
      size_t j = a_idx % BK;
      if (k0 + j < K && i + a_row_base < M) {
        *reinterpret_cast<float4*>(&A_s[i * BK + j]) = *reinterpret_cast<float4*>(&A[i * (a.shape[2]) + k0 + j]);
      } else {
        A_s[i * BK + j] = 0;
		A_s[i * BK + j + 1] = 0;
		A_s[i * BK + j + 2] = 0;
		A_s[i * BK + j + 3] = 0;
      }
    }

    // B_s[i, j] = B[k + i, b_col + j]
    for (size_t b_idx = 4 * threadIdx.x; b_idx < BK * BN; b_idx += 4 * blockDim.x) {
      size_t i = b_idx / BN;
      size_t j = b_idx % BN;
      if (k0 + i < K && j + b_col_base < N) {
        *reinterpret_cast<float4*>(&B_s[i * BN + j]) = *reinterpret_cast<float4*>(&B[(k0 + i) * b.shape[2] + j]);
      } else {
        B_s[i * BN + j] = 0;
        B_s[i * BN + j + 1] = 0;
        B_s[i * BN + j + 2] = 0;
        B_s[i * BN + j + 3] = 0;
      }
    }

    __syncthreads();
    for (size_t k = 0; k < BK; k++) {
	  float B_reg[TN];

	  for (size_t col_counter = 0; col_counter < TN; col_counter++) {
          size_t col = col_counter * TPBN + this_thread_col;
		  B_reg[col_counter] = B_s[k * BN + col];
	  }

      for (size_t row_counter = 0; row_counter < TM; row_counter++) {
        // filling in row a_row_base + row_counter * TPBM + this_thread_row of C
        // which is row row_counter of tmp
        size_t row = row_counter * TPBM + this_thread_row;
        float a_val = A_s[row * BK + k];
        for (size_t col_counter = 0; col_counter < TN; col_counter++) {
          // col b_col_base + col_counter * TPBN + this_thread_col of C
          // which is col col_counter of tmp
		  float b_val = B_reg[col_counter];

          tmp[row_counter * TN + col_counter] += a_val * b_val;
        }
      }
    }
    __syncthreads();
  }
  // now tmp[row * TN + col] goes into RES[(row * TPBM + this_thread_row) *
  // res.shape[2] + (col * TPBN + this_thread_col)]
  for (size_t row_counter = 0; row_counter < TM; row_counter++) {
    for (size_t col_counter = 0; col_counter < TN; col_counter++) {
      RES[(row_counter * TPBM + this_thread_row) * res.shape[2] +
          (col_counter * TPBN + this_thread_col)] =
          tmp[row_counter * TN + col_counter];
    }
  }
}

More optimizations

This is still a work in progress. The most promising next optimizations are…

double-buffering, and
a thorough parameter sweep.

Speed-running Integer Factorization with AI

2026-04-01T00:00:00+00:00

The repo.

The quadratic sieve is a fast integer factorization algorithm: if a positive integer $N$ is the product of two large prime numbers $N = pq$, then quadratic sieve will find the two factors reasonably quickly. “Reasonably” is the key word here: if $N$ is an $n$-bit integer (so $N \approx 2^n$), then the runtime of quadratic sieve is something like $2^{\sqrt{n}}$. This is much faster than trial division (runtime $\sqrt{N} \approx 2^{n/2}$) but still not even close to a polynomial-time algorithm.

Integer factorization is relevant because, if you can factor large integers, you can break the RSA cryptosystem. In fact, it’s precisely because of quadratic sieve and algorithms like it that many systems have switched over from RSA to elliptic curve cryptography in recent years.

Anyway, I’m going to implement quadratic sieve on an Nvidia RTX 3050, with some help from OpenAI’s Codex. I want to see what size of integer I can factor in a reasonable amount of time (let’s say one minute or so). This will be a good excuse for me to explore both

the challenges of adapting an interesting algorithm for parallelization, and
what an AI agent can do.

This is a work in progress: I’ve gone through a couple of iterations with Codex, and I’ve already factored some pretty big numbers, but I’m still experimenting to see just how much I can get the AI to do.

Summary: what coding agents can and can’t do

Before I get to the juicy details, here’s what I’ve learned from working with Codex.

Codex is very, very good at basic coding tasks. It never forgets semicolons. Its code always compiles. It knows how to organize a project, how to start by building something simple that works and make iterative improvements from there.

It’s very good at reading and writing code. It quickly spots the sort of edge cases and off-by-one errors that used to cause me so much grief.

Codex also knows, in broad strokes, how the quadratic sieve algorithm is supposed to work. “First you choose a factor base, then you find relations, then you solve a binary matrix.”

So working with Codex is a tremendous productivity boost.

But Codex is reluctant to do any sort of experimentation or analysis, beyond “write code and see if it works”. Throughout the project, I had to give it two kinds of prompts:

Requests for empirical data: Even though I told Codex I was interested in optimizing performance, the agent was reluctant to run experiments and gather data. For example:
- The agent gave up on a small factorization task because it only found half the necessary number of relations. But the target was 60 seconds, and the program only ran for about 1 second. The solution – to double or triple the size of the search space – would have been obvious if Codex had measured the runtime.
- On a longer-running program, the agent put a bunch of effort into optimizing a trivially small code path (the is_squarefree() function). This function took one millisecond out of a total of about 4 seconds of execution. Again, a misdirected effort that could have been prevented by some simple timing. In fact, it’s even worse: (1) Codex already had timing data in context, it just had to look, and (2) even without experimental data, it should have been clear that is_squarefree() just wasn’t the bottleneck here.
Theoretical analysis (asymptotic estimates, etc.). Codex can do this sort of thing pretty well if I hold its hand, but by default Codex would rather code than think.
- When the search could not find enough relations, Codex suggested decreasing the factor base bound $B$. In fact the solution is precisely the opposite: making $B$ bigger makes relations more plentiful.
- It’s important to find the best search space to search for relations; doing this right gives a huge performance boost. It takes a big of theoretical work to figure out the search space. (It’s nothing terribly complicated; I explain the basics further down in this post.) Codex can do every step of the calculation, but I couldn’t get it to do the full analysis with any degree of autonomy.

In summary: Codex is a terrific coder and a great productivity boost. If I want to get more out of it, the next challenge is to get it to divide its efforts among implementation, empirical data-gathering, and theoretical analysis.

The algorithm

OK, with that summary out of the way, let’s get back to a brief overview of quadratic sieve. I want to give you a sense of

the different parameter choices and how they affect performance, and
how amenable things are to parallelization on GPU.

If you like to read code, take a look at a basic Python implementation.

Just to fix ideas, imagine that $N = pq$ is a 100-200 bit integer (30-70 digits), and the primes $p$ and $q$ are about the same size. With these parameters, we’d like things to run in a few seconds on a GPU.

The algorithm has two steps (not counting pre- and post-processing, which should be “fast”).

Preprocessing: Choose a factor base bound $B$ (for us, maybe $1000$, $10000$, $100000$) and a search plan for the next step.
Search for “relations”. A relation is a pair $(k, a)$ such that \[ \text{all the prime factors of } a^2 - kN \text{ are less than } B. \] In other words: \[ a^2 \equiv \text{product of small primes (mod } N\text{).} \] The number of relations you need to find is a few more than the number of primes less than $B$ (which is about $B / \log B$).
Solve a linear system: Combine some of the relations to get a handful of perfect-square relations of the form \[ a^2 \equiv b^2 \text{(mod } N \text{).} \] This step amounts to solving a large binary matrix.
Postprocessing: For each perfect-square relation, you know that \[ a^2 - b^2 = (a + b) (a - b) \] is a multiple of $N$.
Compute \[ \operatorname{gcd}(a+b, N) \] using the Euclidean algorithm. There is a 50% chance that this gcd will be a nontrivial factor of $N$ (either $p$ or $q$), in which case you are done. If not, repeat with more perfect-square relations until you find a factor.

Asymptotics

In choosing $B$, there is a tradeoff between the search for relations and solving the linear system. If $B$ is too small, relations will be hard to find (it’s a very rare number that is only divisible by 2, 3 or 5!). But the larger $B$ is, the larger the matrix that needs to be solved in the solving step.

Here are some impressionistic asymptotics: I’m ignoring some important log factors and constants so I can paint a clear picture in your head. The goal is not to get precise runtime estimates (those come from experiment!) but to give you a mental model of the tradeoffs.

Suppose $B$ is a $b$-bit integer, and suppose $N \approx B^k$. In other words: \[ B \approx 2^b \] \[ N \approx 2^n \] \[ n = bk. \]

Then the search for relations will take something like $2^{2k}$ time – the smaller $B$ is compared to $N$, the rarer these relations are.

Solving the linear system will take something like $2^{3b}$ time – you’re solving a matrix of size (approximately) $B$, and solving a matrix (at least by the naive algorithm) takes cubic time.

Now you can see where the $2^{\sqrt{n}}$ asymptotic comes from. Suppose you have $2^t$ total time budget. You want to divide it close to evenly between search and solve; let’s say you allocate $2^t$ for each step. (Oops! Did I just double your budget? Don’t worry, this is already a sloppy estimate, one more factor of 2 is no big deal.) This means you want \[ k = t/2 \] \[ b = t/3 \] so you can factor a number of up to \[ n = bk = t^2 / 6 \] bits.

In practice, this means:

The algorithm has two phases, search and solve, which can be run, timed, and optimized independently.
Increasing $B$ makes the search phase faster and the solve phase slower.
The solve phase runtime depends only on the size of $B$, not on $N$. In practice, $B$ in the tens of thousands might take a few seconds on a CPU.

More analysis: the search space.

Like I said before, we want \[ a^2 - k N \] to have a factorization into lots of small factors. Of course, this is most likely to happen if $a^2 - kN$ is small! So the natural thing to do is to pick some smallish integer $k$, and then search for $a$ in some interval around $\sqrt{kN}$. In other words, we’ll look at \[ a = x + \left \lfloor \sqrt{kN} \right \rfloor \] where $k$ is a small positive integer, and $x$ is a small integer (positive or negative).

What sorts of $k$ and $x$ should we search over? I think it’s pretty clear that we should figure out which $k$ and $x$ will make \[ a^2 - k N \] be the smallest, and target those first.

With a little bit of algebra, you can find that \[ a^2 - k N \approx 2 x \sqrt{kN} \] when $x$ is small. So the obvious strategy is to pick some bound $M$ and search all pairs $k$ and $x$ for which \[ -M < 2x \sqrt{kN} < M. \]

At least, I think this is obvious. Codex doesn’t. Codex wants to pick a single value of $k$ and then search over $x$ in an ever-growing interval. I wanted to see if I could get Codex to do the analysis my way; after all, the calculations I just did are well within its abilities! It took a surprising amount of coaxing, but in the end we got a 5-10x speedup on the search step.

OK, enough of theory, let’s get down to the metal.

Parallelizing the search, and an interesting tradeoff

Both of the two big steps (search and solve) could potentially benefit from parallelization, but search is the natural first target. After all, the search is what they call “embarrassingly parallel”: each pair $(k, x)$ either is a hit or it isn’t. On the other hand, imagine trying to solve a matrix – think row reduction or Gauss-Jordan elimination – in parallel. (Wikipedia has a nice animation.) Maybe each thread can take responsibility for reducing a single row, but each pivot row will need to be broadcast to the threads, one after another. And that’s not to mention some of the more complicated optimizations that might be useful in our sparse setting.

So let’s focus on doing the search in parallel.

At first glance, this seems easy enough. Each thread takes responsibility for a single $a^2 - k N$. The thread does a series of trial divisions (by 2, 3, 5, 7, …), until it reaches the bound $B$ – at which point it either accepts or rejects.

But it turns out this simple approach introduces a whole bunch of duplicated computation! To see why, think about a single prime factor, like 2027. Your algorithm ends up testing each of these numbers individually for divisibility by 2027. A much faster approach is to “sieve”: it turns out you can do just a single divisibility test and mark off all the $a$ that make $a^2 - k N$ a multiple of 2027. (As usual, Wikipedia has a terrific animation. And notice that this particular optimization predates the invention of the GPU by quite some years.)

Now imagine you’re searching, say, one million numbers. You can either do one million trial divisions (in parallel, one number per thread), or you can assign a single thread responsibility for finding all multiples of 2027 (parallelizing the work one prime per thread). If you use the sieving approach, instead of one million trial divisions, you do just one trial division, and then you “mark” about $1000000/2027 \approx 500$ values. Should be an easy win!

But…

Computation is fast; memory is slow. We’ve just gone from a pure-compute regime where each thread can (more or less) do the full computation using only its own registers to a random-access regime where each thread is making its own individual, unpredictable writes to global memory.

On a GPU (at least most Nvidia GPUs are like this), threads are organized in “warps” of 32. All the threads in a warp execute the same instruction in lockstep, on the same streaming multiprocessor, with access to the same low-latency shared memory. (Actually, I’m simplifying things somewhat. Threads are organized into “warps” of 32 and “blocks” of a larger size – the programmer has some control over block size but 1024 is typical. Threads in a warp execute at the same time; threads in a block share memory.)

Anyway, imagine we parallize the work “one prime per thread”. So our imaginary 2027 thread is sharing a warp (and memory etc.) with other threads responsible for different primes: 2029, 2039, 2053, and so forth. At each iteration of the loop, each thread will compute the next multiple of its own prime, and then issue a write request to… somewhere in this global array of 1 million items. The write address won’t be cached, because who could have predicted it?

Even worse, these 32 writes will widely spread across memory. Writes to DRAM on a GPU come in 128-byte transactions, and if all 32 threads write to the same 128-byte block of memory, the writes can be “coalesced” into a single transaction. But if each thread writes to multiples of a different prime, the chip will be forced to handle 32 different transactions of 128 bytes each.

And finally, to protect against race conditions (what if two different prime threads try to write to the same spot at the same time?) we’ll need to use atomic operations – introducing further inefficiency.

Empirically, switching from “one $x$ per thread” trial division to “one prime per thread” sieving leads to a substantial (5-10x) slowdown in the search phase.

A more efficient solution

A hybrid approach, a sort of “data-local sieving,” turns out to give the best of both worlds.

A block of threads collaboratively loads a number of $x$ values (1024 values works well) into high-bandwidth shared memory. Once the values have been loaded, the threads will run the sieving procedure on these 1024 values.

// load 1024 x values into shared memory

for (int p_idx = threadIdx.x; p_idx < n_primes; p_idx += blockDim.x) {
    int p = primes[p_idx];
    // x_start = first multiple of p in this interval
    for (int x = x_start; x < x_end; x += p) {
        // label x as a multiple of p
    }
}

// write those 1024 x values back into global memory

As far as memory is concerned, this approach requires only one DRAM read and one write per $x$. Clearly this is best possible.

As for computation, we keep most of the benefits of sieving. Remember: the total computation per prime $p$ is num_x with the naive algorithm, versus $num_x / p$ with sieving – so sieving gives a factor of $p$ savings. This algorithm processes 1024 values of $x$ at a cost of \[ 1 + 1024 / p: \] the initial calculation of x_start is required once, but the inner loop strides through the $x$ values in steps of $p$. The total cost for all $num_x$ values is \[ num_x \left ( \frac{1}{1024} + \frac{1}{p} \right ). \] In words:

When $p < 1024$, the algorithm is almost as compute-efficient as sieving.
When $p > 1024$, the algorithm is 1024 times faster than the naive algorithm. That’s pretty good.

One more note: From experiments, it turns out to be most efficient to run the naive algorithm for very small primes ($p < 32$), and then switch to this hybrid memory-local sieve for $p > 32$.

What’s next?

We’ve done some pretty good optimizations on the search phase. Next I want to see what we can do about:

solve, and
preprocessing.

The solve phase involves row-reducing a binary matrix. It’s not the most natural candidate for parallelization – too much memory interaction between rows – but I’m doing some experiments to see if I can get speedup on a GPU.

The preprocessing phase is also surprisingly time-consuming (~20 sec for a 192-bit $N$). Here again there is room to optimize by factoring out some parallelization-friendly parts for execution on GPU.

Understanding the Adam optimizer

2026-04-01T00:00:00+00:00

The repo.

In this post I’ll offer my own somewhat contrarian explanation of why the Adam optimizer works.

Then I’ll demonstrate my explanation with some experiments on a simple proof-of-concept optimizer I made up, called GradSign.

The Adam optimizer

Adam is a widely-used optimization algorithm that tends to perform very well on deep learning tasks.

Adam is often explained as an extension of stochastic gradient descent (SGD): sample one batch, compute the loss and its gradient, and smooth the result out by taking an exponential moving average of the gradient. Then there’s a step that the standard explanations sort of glide over – something about a second moment (i.e. a variance estimate) for the gradient, something about “adaptive choice of the learning rate” – and then you take your step… and magically end up with a good optimizer.

(A quick Google search turns up plenty of explanations along these lines: for example, here, here, here…)

I’d like to offer a different take on Adam – less calculus, more statistics. My take will suggest a different toy model of an optimizer: not stochastic gradient descent, but an algorithm that only looks at the sign (not the magnitude) of each gradient. I call my optimizer GradSign. I’ll test it out on a simple machine learning task: building an MNIST classifier.

Spoiler alert: No, GradSign doesn’t outperform Adam. In my small experiment, my optimizer performs comparably to Adam, but SGD (surprisingly) outperforms them both.

I hope reading this inspires you to tinker and explore.

This post has two parts:

My reinterpretation of Adam
An explanation of the new optimizer

You can see experimental results with the new optimizer on Github.

Adam reexplained

The Adam optimizer works by keeping first- and second-moment estimates (exponential moving averages) for each the gradient of each parameter; at each step, those estimates are used to determine the change to that parameter.

We will consider a single parameter, one of millions or billions in a large model: Adam works on each parameter independently, so we don’t need to worry about anything else.

I will follow the notation from Algorithm 1 of the original paper.

The algorithm depends on three parameters, with the following suggested values: \[ \alpha = 0.001 \] \[ \beta_1 = 0.9 \] \[ \beta_2 = 0.999. \] Here $\alpha$ is the learning rate. (We will see that, even though $\alpha$ is called a “learning rate”, it does not have the same units as the learning rate in SGD.) The parameters $\beta_1$ and $\beta_2$ determine the timescales for the two exponential moving averages. (With the suggested values, the first moment is averaged with a decay time of 10 iteration steps, and the second with a decay time of 1000.)

Let $g_t$ be the gradient of our favorite parameter at timestep $t$. (Remember, the gradient at each timestep depends on both the parameter values – which are updated at each step – and the random choice of a fresh batch of data.) The exponential moving average of the first moment (mean, $m$) and second moment (uncentered variance, $v$) are computed recursively: \[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \] \[ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2. \] Finally, the parameter update is computed as \[ - \alpha \frac{m_t}{\sqrt{v_t} + \epsilon}. \]

Let us unroll the recurrences. (We will also assume, as a mathematical convenience, that the gradients go infinitely far back in time. In reality one has to worry about how to initialize $m_t$ and $v_t$, but after the first few optimizer steps, our assumption will be a reasonable one.) \[ m_t = (1 - \beta_1) \sum_{i = 0}^{\infty} \beta_1^i g_{t - i} \] \[ v_t = (1 - \beta_2) \sum_{i = 0}^{\infty} \beta_2^i g_{t - i}^2 \] These are exponential moving averages – weighted averages of past values of $g_t$ (or its square), the relevance of past values decaying over time with a characteristic time scale (a physicist might say “half-life”) depending on the $\beta$’s.

I will make three simplifications to make the analysis easier. First, I will ignore the $\epsilon \approx 10^{-8}$ that is thrown into the denominator of \[ \text{update} = - \alpha \frac{m_t}{\sqrt{v_t} + \epsilon}. \] for numerical stability reasons. The $\epsilon$ is just there to make sure the algorithm does something sensible when the gradient vanishes.

Second, I will replace the exponential weighted average (with parameters $\beta$) with a simple unweighted average (“moving window”) over the past $n$ terms (where $n$ is the window size). This would be a computational disaster to implement – we would have to store $n$ values per moment per parameter, rather than just $1$ – but it will help us build a clean mental model.

Finally, I will assume $\beta_1 = \beta_2$, or in other words that the two moving averages are computed over the same size of window $n_1 = n_2$. Unlike the first two simplifications, assuming $\beta_1 = \beta_2$ really does change the behavior of the optimizer in a meaningful way. I’ll come back to this later, because I want to understand this simplified model first.

So, let’s say $\beta_1 = \beta_2 = 0.99$, so now $m_t$ and $v_t$ are averages over the past 100 iterations: \[ m_t = \frac{\sum_{i=0}^{99} g_{t-i}}{100} \] \[ v_t = \frac{\sum_{i=0}^{99} g_{t-i}^2}{100} \] and \[ \text{update} = - \alpha \frac{ \sum g_{t-i} / 100 } { \sqrt{ \sum g_{t-i}^2 / 100 } }. \] But now that fraction is something we can understand: it is nothing more than the cosine similarity between the vectors \[ \hat{g} = (g_{t-99}, g_{t-98}, \ldots, g_{t-1}, g_t) \] and \[ \hat{1} = (1, 1, \ldots, 1, 1)! \]

In other words, if we call $\theta$ the angle between those two vectors (my apologies, this is not the same as the $\theta$ in the paper), then the update to our parameter is simply \[ \text{update} = - \alpha \cos \theta. \]

We immediately see:

$\alpha$ is the largest possible update to our parameter, and
the size of the update is determined by how close the different $g_t$’s are to each other, rather than the size of $g_t$. (The Adam paper calls this a “signal-to-noise ratio”.)

In fact, the cosine similarity is invariant under scaling the $g_t$’s. (Contrast this to classical gradient descent, where the step size is a product of learning rate and gradient, and you have to do lots of extra work to make sure gradients in different parts of the network have the same scale.)

Thinking statistically

When we update our parameter (we’re just focusing on one parameter, remember? the rest will come along for the ride) our goal is to make the loss decrease. Calculus teaches us that the gradient (a first derivative) determines whether the loss will go up or down, but let me reframe the question statistically: How confident are we that making this change will decrease the loss?

In the statistical framing, there are two sources of noise we need to worry about:

Sample randomness: each gradient is computed from only a small batch of data.
The gradient landscape: the slope might be negative now, but if our step size is too large we may overshoot and end up climbing back uphill.

Looking at the past $n$ steps can protect us against both types of noise! We’re trying to evaluate a statistical hypothesis, like “decreasing this parameter will result in a lower loss against the next randomly chosen batch”. Clearly, a natural statistic is “on how many of the last $n$ batches was this gradient positive?” If the gradient has consistent positive values across batches, we can expect the gradient to have a positive value on the next batch as well.

Similarly, we want to know if the loss landscape is bumpy or smooth. We can think of the past $n$ parameter updates as a sort of random walk across this landscape (not uniformly random of course, but governed by a complicated stochastic process). We’re about to take another step in this random walk. If the gradient has been consistent in the past, we can have more confidence that our gradient will remain positive through the full length of our next step.

So, instead of thinking of an optimizer step as a gradient update, I think of it as a statistical confidence test: how confident are we that this step will result in negative gradient on the next (yet unseen) batch, at the current parameter position, the updated position, and everywhere in between?

If this is what’s going on with Adam, then maybe the size of the gradient doesn’t matter at all. Maybe all that matters is: At how many of the past $n$ timesteps was the gradent positive, and at how many was it negative?

We’ll turn this idea into a new optimizer algorithm soon, but first I want to wrap up a couple of loose ends.

The role of $\beta_1$ and $\beta_2$

Earlier on, we made the simplifying assumption that $\beta_1 = \beta_2$. I told you that we were simplifying away something important, but I didn’t tell you what. Now it’s time to fix that.

In the real world, some good values for $\beta_1$ and $\beta_2$ are \[ \beta_1 = 0.9 \] \[ \beta_2 = 0.999. \] In other words, the first moment estimate (the numerator) averages the gradient over the last 10 or so timesteps, while the second moment (the denominator) averages its square over the last 1000.

Normally, you might think, this won’t make a big difference. But it makes a big difference when the gradient is sparse.

Imagine a parameter that is usually unimportant: its gradient is close to zero. But every so often, the parameter becomes very important, and its gradient gets big. (You might imagine that in a large, complex LLM, this one parameter is responsible for learning one particular thing – and that one thing only rarely shows up in the training data.)

The role of $\beta_2$ is to remember that this gradient has a track record of sudden spikes. When a gradient has spiked in the past, we don’t want to make updates based on small gradients. But we also don’t want to continue making updates based on a gradient from many steps back. The solution is to make $\beta_2$ large (remember the spike and slow down learning for 1000 steps) but keep $\beta_1$ small (stop making updates 10 steps after the spike).

Adam and memory

The Adam optimizer stores two floating-point values (the moment estimates $m_t$ and $v_t$) per parameter. While the forward and backward pass can often be computed in 16-bit precision, the Adam optimizer state requires 32 bits for each of $m_t$ and $v_t$. (Storing optimizer state in 16-bit precision leads to numerical instability and degrades training performance.) In a typical training run, memory usage is dominated by per-parameter costs: 4 bytes for the (full-precision) master copy of the parameter, 2 for the 16-bit downcasted parameter, 2 for the gradient, and 8 for the optimizer state – so the optimizer is responsible for about half the total memory usage.

Wouldn’t it be great if we could use less?

Adam uses 8 bytes per parameter. Here is an optimizer that uses just 1.

GradSign

GradSign is a simple proof-of-concept optimizer that:

only uses the sign (+ or -) of the gradient of each parameter, and
only keeps one byte of optimizer data, per parameter.

The update code is as follows:

def step(self):
    for n, p in self.named_params:
        new_count = 2 * (p.grad > 0) - 1
        self.grad_counts[n] -= (self.grad_counts[n] + 4) // 8
        self.grad_counts[n] += 8 * new_count

        p.data -= self.lr * (self.grad_counts[n] / 64.0)

The dictionary self.grad_counts stores, for each parameter, a quantized exponential moving average of the sign of the gradient over the past several timesteps. The moving average has a characteristic timescale of 8 timesteps (in other words, $\beta = 0.875$). The quantized moving average is scaled to be between $-64$ and $64$, so that it fits within a signed 8-bit integer.

I chose these parameters to control quantization error in the decay term

self.grad_counts[n] -= (self.grad_counts[n] + 4) // 8.

The decay term will be nonzero as soon as the value self.grad_counts exceeds $\pm 4$. Each update to the average causes the count to change by $\pm 8$, so the grad_counts parameter cannot get stuck in the no-decay region.

Experiment

The experiment code and detailed results are posted on Github.

In summary, I run the GradSign on 1000 batches of 32 samples each, which amounts to a single pass through just over half of the dataset. The resulting model achieves over 98% performance.