Three ways to think about matrix multiplication

Your mental model for matrix multiplication maybe be well established but did you know there are actually 3 main ways to think about matrix multiplications? While everyone might have a favorite one, playing with each view might help you understand problems in new ways.

1. The scalar product view

This is probably the most common one, since it is at the heart of the definition of matrix multiplication.

Here, each element $(i,j)$ of the output matrix is thought as scalar product between the left row $i$ and right column $j$.

$$ \begin{bmatrix} \color{red}{0} & \color{red}{1} & \color{red}{2} \cr 3 & 4 & 5 \cr \color{lime}{6} & \color{lime}{7} & \color{lime}{8} \end{bmatrix} \begin{bmatrix} \color{red}{9} & 10 & \color{lime}{11} \cr \color{red}{12} & 13 & \color{lime}{14} \cr \color{red}{15} & 16 & \color{lime}{17} \end{bmatrix} = \begin{bmatrix} \color{red}{42} & 45 & 48 \cr 150 & 162 & 174 \cr 258 & 279 & \color{lime}{300} \end{bmatrix} $$ I learnt it first this way, but quickly moved to alternatives ones, my favorites.

2. The row view

If you multiply a row vector by a matrix, the row vector is actually defining a linear combination of the matrix rows.

$$ \begin{bmatrix} \color{red}{0} & \color{cyan}{1} & \color{lime}{2} \cr 3 & 4 & 5 \cr 6 & 7 & 8 \end{bmatrix} \begin{bmatrix} \color{red}{9} & \color{red}{10} & \color{red}{11} \cr \color{cyan}{12} & \color{cyan}{13} & \color{cyan}{14} \cr \color{lime}{15} & \color{lime}{16} & \color{lime}{17} \end{bmatrix} = \begin{bmatrix} \color{yellow}{42} & \color{yellow}{45} & \color{yellow}{48} \cr 150 & 162 & 174 \cr 258 & 279 & 300 \end{bmatrix} $$

This view, along with the column view which is similar but acting on the other way around, feels more natural when thinking about vector spaces. Also, it makes operations on rows such as permutation, substitution or selection really straightforward.

3. The column view

Same as the row view, but now acting on the columns. If you multiply a matrix by a column vector, the column vector defines a linear combination of the columns of the matrix.

$$ \begin{bmatrix} \color{red}{0} & \color{cyan}{1} & \color{lime}{2} \cr \color{red}{3} & \color{cyan}{4} & \color{lime}{5} \cr \color{red}{6} & \color{cyan}{7} & \color{lime}{8} \end{bmatrix} \begin{bmatrix} \color{red}{9} & 10 & 11 \cr \color{cyan}{12} & 13 & 14 \cr \color{lime}{15} & 16 & 17 \end{bmatrix} = \begin{bmatrix} \color{yellow}{42} & 45 & 48 \cr \color{yellow}{150} & 162 & 174 \cr \color{yellow}{258} & 279 & 300 \end{bmatrix} $$

Notice, columns operations are performed by column vectors on the right, while row operations are performed by row vectors on the left.

Application examples

Here I cite 3 examples where each view helps in modeling the problem.

Neural networks are best described by matrix multiplications from layers to layers. The output $\mathbf{y}$ of simple feed forward layer with weights $\mathbf{W}$, input $\mathbf{x}$, bias $\mathbf{b}$ and non linearity function $g$ is typically notated $\mathbf{y} = g(\mathbf{W}\mathbf{x} + \mathbf{b})$. In this setup, the activations $\mathbf{y}$ are often interpreted with the scalar product view between the input $\mathbf{x}$ and the weights (columns of $\mathbf{W}$). But another way to view this is to consider $\mathbf{y}$ being a linear combination of the weights $\mathbf{W}$ following $\mathbf{x}$, then added to the bias $\mathbf{b}$ and given to the non linearity $g$. Seeing this also shows how without $g$, a neural network is equivalent to a setup of linear regression, where we optimize for the weights for the total dataset $\mathbf{Y}=\mathbf{X}\mathbf{W}+\mathbf{\epsilon}$.
Transition matrices can be used in modeling discrete dynamics (like in Kalman filters) where the next state is a linear combination of the previous state. Using the Euler method we often update the next position $p_{i+1}$ as the previous position added to a fraction of velocity $v_{i}$. Notating the state $[p_i, v_i]^{T}$ as $\mathbf{x_i}$, the update equation is $\mathbf{x_{i+1}}=\mathbf{F}\mathbf{x_i}$. $$ \begin{bmatrix} p_{i+1} \cr v_{i+1} \end{bmatrix}= \begin{bmatrix} 1 & \delta t\cr 0 & 1 \end{bmatrix} \begin{bmatrix} p_{i} \cr v_{i} \end{bmatrix} $$ Here, the natural way to build $\mathbf{F}$ and read this equation is by describing each state variable update by a row of $\mathbf{F}$, hence using the row view.
Linear regression aims at finding the unknown vector $\mathbf{x}$ that satisfies the equation $\mathbf{Ax}=\mathbf{y}$, the very essence of this problem is to find the right combination of the columns of $\mathbf{A}$, hence the column view is here natural to work with (more details on this in my other post here).

As you see, each view can facilitate the mental model for a type of problem (at least it works for me), so I hope you will consider learning (or relearning) them!

Finishing meme

Get back to top

3 February 2023