A Brief Introduction to AI Math

By Tyler Clarke in Artificial Intelligence on 2025-4-8

Editor's note: if you're reading this to get code snippets, don't bother. While there is much use for writing code (obviously), and it's necessary to go over it, there is a lot of introductory material on popular AI frameworks, and not a whole lot covering the math that makes them work in an accessible manner. Deadly Boring Math is and will remain focused solely on theoretical mathematics; if I ever write a piece on programming simple AI (which would be cool; I might), it'll go live on my personal blog.

Hello, everyone! Today's a bit of a change of pace: rather than our usual advanced calculus material, we're covering... Artificial intelligence! As this is a math blog, and not an AI blog, we're not going to cover prompt engineering or anything like that. Instead, we're looking at two critical concepts: Matrix multiplication as it relates to dense layers, and multivariable derivatives as they relate to gradient descent.

Let's start with matrix multiplication. I'm assuming you already have some idea of what matrices and vectors are; if you can't construct and analyze matrices and vectors and take dot products, you should probably read about those first. Here's the groundbreaking idea: a dense layer is just a matrix with `n+1` dimensions (where `n` is the number of input and output dimensions), with `a` columns and `b` rows where `a` is the length of the input vector and `b` is the length of the output vector. To comprehend why this is a useful way to look at dense layers, consider that the inputs of a given neural network are a list of numbers with a known length, and the outputs are also a list of numbers with known length. The matrix defines how every value in the inputs affects every value in the outputs. Essentially, then, matrix multiplication is a repeated dot product: you take the dot product of `"input" cdot "row"` for every row in the matrix, and each result is an item in the output.

For example, if we have a matrix `[[1, 2, 3], [4, 5, 6], [7, 8, 9]]`, and an input vector `[0, 1, 0]`, our output vector is `[[1, 2, 3] cdot [0, 1, 0], [4, 5, 6] cdot [0, 1, 0], [7, 8, 9] cdot [0, 1, 0]] = [2, 5, 8]`. That's a dense layer!

In practice, inputs and outputs will oftentimes have nested dimensions - meaning each value is a matrix or a vector, rather than a simple number. However, the general formulation doesn't change; you just have to do it more times. It's best to leave that to the computer, considering some dense layers will have hundreds or thousands of values (if not millions or billions).

This is very abstract, but it becomes very useful when we give the inputs and outputs meaning. For instance, if we want to do animal classification, we could have an input layer structure like `["fuzziness", "legs", "wings"]` and an output layer structure like `["dogfulness", "batfulness", "duckfulness"]`. We want to transform how duck-ey, dog-ey, and cat-ey an animal is given those properties of the animal: we do this with a dense layer matrix. There are many possible ways to architect this, but let's just say the algorithm picks which animal we're looking at based on which value in the outputs is largest: so `[15, 12, 9]` is a dog, `[-4, -128, 7]` is a duck, etc. We could perform the transformation with a layer like this: `[[2, 1, -5], [3, -10, 3], [-5, 3, 3]]`. If we have an animal that's not very fuzzy, has 0 legs, and has two wings, we can embed that animal as `[1, 0, 2]`, and then multiply it by the layer matrix to get an output of `[-8, 9, 1]`. Pretty undogful, somewhat duckful, but very batful. It's a bat!

In a real model, of course, there would be hundreds of parameters, and they would not have assigned context: the training process would produce that context based on raw data. That segues into training. We need some way to generate that dense layer based on data we already know. Enter gradient descent!

Gradient descent requires some visualization to understand. There's no easy way to visualize it generally without having some background in multivariable calculus, but our example is simple enough that it should provide a decent look at the mechanics. Imagine the animal properties as a coordinate system. Every imaginable animal is a point in space. Because we only have three properties, every imaginable animal is a point in 3d space, so if you have a geometry background this should be easy enough to visualize. Because this inputs a 3d vector and outputs a 3d vector, we have two spaces: property-space and animal-space. The entire AI here is just a function `F(x, y, z)` that takes a point in property-space and outputs a value in animal-space. We want to give this AI model a couple different labeled pairs - points in property-space that reference a known location in animal-space. For instance, the Absolute Duck label could be the labeled pair mapping, in the form input -> output, `[0, 2, 2] -> [0, 0, 1]`. An not-at-all-fuzzy animal with 2 wings and 2 legs is 100% duckful and 0% any other animal. It's easy to imagine a matrix that contains this mapping, but we want a bunch more mappings: one per animal (in practice we'd want dozens if not hundreds showing lots of different possibilities for animals. For instance, is a 3-legged fuzzy thing with one wing closer to a bat, or a dog?). Let's just add two mappings, for simplicity: `[5, 4, 0] -> [1, 0, 0]`. A pretty-fuzzy animal with 4 legs is absolutely dogful. `[3, 0, 2] -> [0, 1, 0]`. A somewhat-fuzzy animal with 2 wings is absolutely batful.

While it's possible to use a number of techniques to approximate these pairs in a matrix, the most efficient and scalable one by far is gradient descent. The core idea of gradient descent is to efficiently optimize an error function: say we have a function `e(P, C)` that takes an Prediction and a Correct value and returns some form of distance between them (it should return 0 when they're equal); the goal is to get `e(P, C)` as low as possible for every possible pair of values. Note that Prediction is the result of running the model on a point in property-space, and C is the target result in animal-space. A naive optimizer is evolutionary training:

Create a randomized model `m`.
Create a bunch of slightly-randomized copies of `m`.
For each randomized copy of `m`, determine the average error - this is the average value of `e(P, C)` for every given training pair of points.
Set `m` to the one that had the lowest average error.
If average error is low enough, exit. Otherwise, jump to step 2.

This actually works pretty well for simple models, but there's an obvious problem as models get more complicated: randomly guessing the right answer is much harder when there are 100 parameters. Evolutionary training is obviously not going to work for a serious model. Gradient descent is much more complicated, but pays off much better. It's not terribly easy to visualize this part, but imagine that the `e`-function is a scalar field - a value at every point in space, rather like the barometric pressure, or "existence of bunnies". If `"bunnies"(x, y, z)` reports the amount of bunnies at every given point in space, so `"bunnies"("your_location")` is the number of bunnies at your location (spoiler alert: the result is always "not enough"), then `e(P, C)` is the amount of error at every given point in space. Less cute than bunnies, but absolutely vital.

An interesting property of multivariable functions is the gradient. This is a vector of partial derivatives: `grad f = [frac {df} {dx}, frac {df} {dy}, frac {df} {dz}]`. That won't make much sense to you unless you've at least some calculus knowledge (if you've only taken calculus 1: the trick is to hold all the variables that aren't being differentiated constant, so `frac {d} {dx} xy = y`). Because the gradient is a vector, we can make some useful observations: for instance, increasing the input values in the direction of the gradient vector (read: literally just adding the gradient vector to the inputs) increases the value of the function, and moving away decreases. Hence, for a given learning rate `lambda`, error function of a set of parameters `e(phi)`, gradient descent simply means that `phi_{n+1} = phi_n - lambda grad e(phi_n)`: in simple english, the next value of the parameters we're tuning is the old value minus the gradient of the function of those values. Yikes.

Surprised to see `lambda` there? That's our training coefficient. Essentially, it's how far we're going to step.

How can this be used to tune our model? Consider our layer as not a matrix but as three row vectors that represent the parameters of each animal type. If `phi` is our parameters for dogfulness, `phi = [a, b, c]`, and `[x, y, z]` is our point in property-space, we can say that the error function is `e(phi) = ((phi cdot [x, y, z]) - d)^2`. This works out to `e = (ax + by + cz - d)^2`. Plug in the values for dogfulness and our point in property-space, and you get `e = (5a + 4b - 1)^2`. Differentiation gives us `grad e = [50a + 40b - 10, 40a + 32b - 8, 0]`.

Let's assume that `[a, b, c]` is initialized to `[1, 1, 1]`. Given also that `lambda = frac 1 20`, the gradient descent tells us that `e_{n+1} = e_n - lambda [80, 64, 0]`. Picking `lambda = frac 1 102.5` (approximately 1 divided by the magnitude), we get `phi_{2} = [0.22, 0.38, 0]`. Is this closer to correct? With our original weights, the error value was `e = 64`, but now it's `e = 2.6244`. That's... definitely an improvement!

These same principles can be applied to tune the parameters for ducks and bats. Repeated over a few steps, eventually this would reach a matrix that almost perfectly maps our training data! That's the fundamental principle of gradient descent: find a direction that it makes sense to walk in, go there, find a new direction, repeat. Gradient descent performs remarkably well even over complex problems, because it can quickly choose a better path. Different error functions and values of `lambda` can also make a difference. Naturally, this isn't the whole story: there's a lot more to AI than dense layers and gradient descent. You've probably heard of LSTM layers and the Adam7 optimizer - both of these use many of the same concepts as our simple dogfulness example here, but applied in a far more nuanced way.