A few months ago, I defended my thesis and earned the title of “doctor.” I’m excited to share the contents of my thesis defense with you in this blog post, where you can get a glimpse of the fascinating research I conducted over the past few years. The post is divided into several parts, with the level of technical detail increasing gradually.

If you’re interested in reading my full thesis, you can find it here. I’ve also made my defense slides available for download here.

My thesis focuses on low-rank tensors, but to understand them, it’s important to first discuss low-rank matrices. You can learn more about low-rank matrices in this blog post. A low-rank matrix is simply the product of two smaller matrices. For example below we write the matrix \(A\) as the product \(A=XY^\top\).

In this case, the matrix \(X\) is of size \(m\times r\) and the matrix \(Y\) is of size \(n\times r\). This usually means that the product \(A\) is a rank-\(r\) matrix, though it could have a lower rank if for example one of the rows of \(X\) or \(Y\) is zero.

While many matrices encountered in real-world applications are not low-rank, they can often be well approximated by low-rank matrices. Images, for example, can be represented as matrices (if we consider each color channel separately), and low-rank approximations of images can give recognizable results. In the figure below, we can see several low-rank approximations of an image, with higher ranks giving better approximations of the original image.

To determine the “best” rank-\(r\) approximation of a matrix \(A\), we can solve the following optimization problem:

\[\min_{B \text{ rank } \leq r} \|A - B\|\]There are several ways to solve this approximation problem, but luckily in this case there is a simple closed-form solution known as the *truncated SVD*. To apply this method using numpy, we can use the following code:

```
def low_rank_approx(A, r):
U, S, Vt = np.linalg.svd(A)
return U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]
```

One disadvantage of this particular function is that it hides the fact that the output has rank \(\leq r\), since we’re just returning an \(m\times n\) matrix. However, we can fix this easily as follows:

```
def low_rank_approx(A, r):
U, S, Vt = np.linalg.svd(A)
X = U[:, :r] @ np.diag(S[:r])
Y = Vt[:r, :].T
return X, Y
```

Low-rank matrices are computationally efficient because they enable fast products. If we have two large \(n\times n\) matrices, it takes \(O(n^3)\) flops to compute their product using conventional matrix multiplication algorithms. However, if we can express the matrices as low-rank products, such as \(A=X_1Y_1^\top\) and \(B=X_2Y_2^\top\), then computing their product requires only \(O(rn^2)\) flops. Even better, the product can be expressed as the product of two \(n\times r\) matrices using only \(O(r^2n)\) flops, which is potentially much less than \(O(n^3)\). Similarly, if we want to multiply a matrix with a vector, a low-rank representation can greatly reduce the computational cost.

The decomposition of a size \(m\times n\) matrix \(A\) into \(A=XY^\top\) uses only \(r(m+n)\) parameters, compared to the \(mn\) parameters required for the full matrix. (In fact, due to symmetry, we only need \(r(m+n) - r^2\) parameters.) This implies that with more than \(r(m+n)\) entries of the matrix known, we can infer the remaining entries of the matrix. This is called *matrix completion*, and we can achieve it by solving an optimization problem, such as:

where \(\Omega\) is the set of known entries of \(A\), and \(\mathcal P_\Omega\) is the projection that sets all entries in a matrix not in \(\Omega\) to zero.

Matrix completion is illustrated in two examples of reconstructing an image as a rank 100 matrix, where 2.7% of the pixels were removed. The effectiveness of matrix completion depends on the distribution of the unknown pixels. When the unknown pixels are spread out, matrix completion works well, as seen in the first example. However, when the unknown pixels are clustered in specific regions, matrix completion does not work well, as seen in the second example.

There are various methods to solve the matrix completion problem, and one simple technique is alternating least squares optimization, which I also discussed in this blog post. This approach optimizes the matrices \(X\) and \(Y\) alternately, given the decomposition \(B=XY^\top\). Another interesting method is solving a slightly different optimization problem which turns out to be *convex*, which can thus b solved using the machinery of convex optimization. Another effective method is Riemannian gradient descent, which is also useful for low-rank tensors. The idea is to treat the set of low-rank matrices as a Riemannian manifold, and then use gradient descent methods. The gradient is projected onto the tangent space of the manifold, and then a step is taken in the projected direction, which keeps us closer to the constraint set. The projection back onto the manifold is usually cheap to compute, and can be combined with the step into a single operation known as a *retraction*. The challenge of Riemannian gradient descent is to find a retraction that is both cheap to compute and effective for optimizing the objective.

Let’s now move on to the basics of tensors. A tensor is a multi-dimensional array, with a vector being an *order 1 tensor* and a matrix being an *order 2 tensor*. An order 3 tensor can be thought of as a collection of matrices or an array whose entries can be represented by a cube of values. Unfortunately, this geometric way of thinking about tensors breaks down at higher orders, but in principle still works.

Some examples of tensors include:

- Order 1 (vector):
*Audio signals, stock prices* - Order 2 (matrix):
*Grayscale images, excel spreadsheets* - Order 3:
*Color images, B&W videos, Minecraft maps, MRI scans* - Order 4:
*Color video, fMRI scans*

Recall that a matrix is low-rank if it is the product of two smaller matrices, that is, \(A=XY^\top\). Unfortunately, this notation doesn’t generalize well to tensors. Instead, we can write down each entry of \(A\) as a sum: \(A[i,j] = \sum_{\ell=1}^rX[i,\ell]Y[j,\ell]\)

Similarly, we could write an order 3 tensor as a product of 3 matrices as follows:

\[A[i, j, k] = \sum_{\ell=1}^r X[i,\ell]Y[j,\ell]Z[k,\ell]\]However, if we’re dealing with more complicated tensors of higher order, this kind of notation can quickly become unwieldy. One way to get around this is to use a diagrammatic notation, where tensors are represented by boxes with one leg (edge) for each of the tensor’s indices. Connecting two boxes via one of these legs denotes summation over the associated index. For example, matrix multiplication is denoted as follows:

To make it clearer which legs can be contracted together, it’s helpful to label them with the dimension of the associated index; it is only possible to sum over an index belonging to two different tensors if they have the same dimension.

We can for example use the following diagram to depict the low-rank order 3 tensor described above:

In this case, we sum over the same index for three different matrices, so we connect the three legs together in the diagram. This resulting low-rank tensor is called a “CP tensor”, where “CP” stands for “canonical polyadic”. This tensor format can be easily generalized to higher order tensors. For an order-d tensor, we can represent it as:

\[A[i_1,i_2,\dots,i_d] = \sum_{\ell=1}^r X_1[i_1,\ell]X_2[i_2,\ell]\cdots X_d[i_d,\ell]\]The CP tensor format is a natural generalization of low-rank matrices and is straightforward to formulate. However, finding a good approximation of a given tensor as a CP tensor of a specific rank can be difficult, unlike matrices where we can use truncated SVD to solve this problem. To overcome this limitation, we can use a slightly more complex tensor format known as the “*tensor train format*” (TT).

Let’s consider the formula for a low-rank matrix decomposition once again: \(A=C_1C_2\)

We can express this as follows:

\[A[i_1, i_2] = \sum_{\ell}^r C_1[i_1,\ell]C_2[\ell,i_2]\]Alternatively, we can rewrite this as the product of row \(i_1\) of the first matrix and column \(i_2\) of the second matrix, like so: \(A[i_1,i_2] = C_1[i_1,:]C_2[:,i_2]\) This can be visualized as follows:

To extend this to an order-3 tensor, we can represent \(A[i_1,i_2,i_3]\) as the product of 3 vectors, which is known as the CP tensor format. Alternatively, we can express \(A[i_1,i_2,i_3]\) as a vector-matrix-vector product, like so:

\[\begin{align} A[i_1,i_2,i_3] &= C_1[i_1,:]C_2[:,i_2,:]C_3[:,i_3]\\ &= \sum_{\ell_1=1}^{r_1}\sum_{\ell_2=1}^{r_2} C_1[i_1,\ell_1]C_2[\ell_1,i_2,\ell_2]C_3[\ell_2,i_3] \end{align}\]This can be represented visually as shown below:

Extending this to an arbitrary order is straightforward. For example, for an order-4 tensor, we would write each entry of the tensor as a vector-matrix-matrix-vector product, like so:

\[\begin{align} A[i_1,i_2,i_3,i_4] &= C_1[i_1,:]C_2[:,i_2,:]C_3[:,i_3,:]C_4[:,i_4]\\ &= \sum_{\ell_1=1}^{r_1}\sum_{\ell_2=1}^{r_2} \sum_{\ell_3=1}^{r_3} C_1[i_1,\ell_1]C_2[\ell_1,i_2,\ell_2]C_3[\ell_2,i_3,\ell_3] C_4[\ell_3,i_4]. \end{align}\]which can be depicted like this:

Let’s translate the formula back into diagrammatic notation. We want to represent an order 4 tensor as a box with four legs, expressed as the product of a matrix, two order 3 tensors, and another matrix. This is the resulting diagram:

An arbitrary tensor train can be denoted as follows:

This notation helps us understand why this decomposition is called a tensor train. Each box of order 2/3 tensors represents a “carriage” in the train. We can translate the above diagram into a train shape, like this:

Although I am not skilled at drawing, we can use stable diffusion to create a more aesthetically pleasing depiction:

From what we have seen so far it is not obvious what makes the tensor train decomposition such a useful tool. Although these properties are not unique to the tensor train decomposition, here are some reasons why it is a good decomposition for many applications.

**Computing entries is fast:** Computing an arbitrary entry \(A[i_1,\dots,i_d]\) is very fast, requiring just a few matrix-vector products. These operations can be efficiently done in parallel using a GPU as well.

**Easy to implement:** Most algorithms involving tensor trains are not difficult to implement, which makes them easier to adopt. A similar tensor decomposition known as the hierarchical tucker decomposition is much more tricky to use in practical code, which is likely why it is less popular than tensor trains despite theoretically being a superior format for many purposes.

**Dimensional scaling:** If we keep the ranks \(r = r_1=\dots=r_{d-1}\) of an order-\(d\) tensor train fixed, then the amount of data required to store and manipulate a tensor train only scales linearly with the order of the tensor. A dense tensor format would scale exponentially with the tensor order and quickly become unmanageable, so this is an important property. Another way to phrase this is that tensor trains *do not suffer from the curse of dimensionality.*

**Orthogonality and rounding:** Tensor trains can be *orthogonalized* with respect to any mode. They can also be *rounded*, i.e. we can lower all the ranks of the tensor train. These two operations are extremely useful for many algorithms and have a reasonable computational cost of \(O(r^3nd)\) flops, and are also very simple to implement.

**Nice Riemannian structure:** The tensor trains of a fixed maximum rank form a Riemannian manifold. The tangent space, and orthogonal projections onto this tangent space, are relatively easy to work with and compute. The manifold is also topologically closed, which means that optimization problems on this manifold are well-posed. These properties allow for some very efficient Riemannian optimization algorithms.

With the understanding that low-rank tensors can effectively represent discretized functions, I will demonstrate how tensor trains can be utilized to create a unique type of machine learning estimator. To avoid redundancy, I will provide a condensed summary of the topic, and I invite you to read my more detailed blog post on this subject if you would like to learn more.

Let’s consider a function \(f(x,y)\colon I^2\to \mathbb R\) and plot its values on a square. For instance, we can use the following function:

\[f(x,y) = 3\cos(10(x^2 + y^2/2)) -\sin(20(2x-y))/2\]Note that grayscale images can be represented as matrices, so if we use \(m\times n\) pixels to plot the function, we get an \(m\times n\) matrix. Surprisingly, this matrix is always rank-4, irrespective of its size. We illustrate the rows of matrices \(X\) and \(Y\) of the low-rank decomposition \(A=XY^\top\) below. Notice that increasing the matrix size doesn’t visibly alter the low-rank decomposition.

This suggests that low-rank matrices can potentially capture complex 2D functions. We can extend this to higher dimensions by using low-rank tensors to represent intricate functions.

We can use low-rank tensors to parametrize complicated functions using relatively few parameters, which makes them suitable as a supervised learning model. Suppose we have a few samples \(y_j = \hat f(x_j)\) with \(j=1,\dots, N\) for \(x_j\in\mathbb R^{d}\), where \(\hat f\) is an unknown function. Let \(f_A\) be the discretized function obtained from a tensor or matrix \(A\). We can formulate supervised learning as the following least-squares problem:

\[\min_{A} \sum_{j=1}^N (f_A(x_j) - y_j)^2\]Each data point \(x_j\) corresponds to an entry \(A[i_1(x_j),\dots,i_d(x_j)]\), which allows us to rephrase the least-squares problem as a matrix/tensor completion problem.

Let’s see this in action for the 2D/matrix case to gain some intuition. First, let’s generate some random points in a square and sample the function \(f(x,y)\) defined above. On the left, we see a scatterplot of the random value samples, and next, we see what this looks like as a discretized function/matrix.

If we now apply matrix completion to this, we get the following. First, we see the completed matrix using a rank-8 matrix, and then the matrices \(X, Y\) of the decomposition \(A=XY^\top\).

What we have so far is already a useable supervised learning model; we plug in data, and as output, it can make reasonably accurate predictions of data points it hasn’t seen so far. However, the data used to train this model is uniformly distributed across the domain. Real data is rarely like that, and if we plot the same for images for less uniformly distributed data the result is less impressive:

How can we get around this? Well, if the data is not uniform, then why should we use a uniform discretization? For technical reasons, the discretization is still required to be a grid, but we can adjust the spacing of the grid points to better match the data. If we do this, we get something like this:

While the final function (3rd plot) may look odd, it does achieve two important things. First, it makes the matrix-completion problem easier because we start with a matrix where a larger percentage of entries are known. And secondly, the resulting function is accurate in the vicinity of the data points. So long as the distribution of the training data is reasonably similar to the test data, this means that the model is accurate on most test data points. The model is potentially not very accurate in some regions, but this may simply not matter in practice.

Now, let’s dive into the high-dimensional case and examine how tensor trains can be employed for supervised learning. In contrast to low-rank matrices, we will utilize tensor trains to parameterize the discretized functions. This involves solving an optimization problem of the form:

\(\min_{ A\in \mathscr M}\sum_{j=1}^N\left(A[i_1(\mathbf x_j),\dots, i_d(\mathbf x_j)]-y_j\right)^2,\tag{\)\star\(}\)

where \(\mathscr M\) denotes the manifold of all tensor trains with a given maximum rank. To tackle this optimization problem effectively, we can use the Riemannian structure of the tensor train manifold. This approach results in an optimization algorithm similar to gradient descent (with line search) but utilizing Riemannian gradients instead.

Unfortunately, the problem \((\star)\) is very non-linear and the objective has many local minima. As a result, any gradient-based method will only produce good results if it has a good initialization. The ideal initialization is a tensor that describes a discretized function with low training/test loss. Fortunately, we can easily obtain such tensors by training a different machine learning model (such as a random forest or neural network) on the same data and then discretizing the resulting function. However, this gives us a dense tensor, which is impractical to store in memory. We can still compute any particular entry of this tensor cheaply, equivalent to evaluating the model at a point. Using a technique called TT-cross, we can efficiently obtain a tensor-train approximation of the discretized machine learning model, which we can then use as initialization for the optimization routine.

But why go through all this trouble? Why not just use the initialization model instead of the tensor train? The answer lies in the speed and size of the resulting model. The tensor train model is much smaller and faster than the model it is based on, and accessing any entry in a tensor train is extremely fast and easy to parallelize. Moreover, low-rank tensor trains can still parameterize complicated functions.

To summarize the potential advantages of TTML, consider the following three graphs (for more details, please refer to my thesis):

Based on the results shown in the graphs, it is clear that the TTML model has a significant advantage over other models in terms of size and speed. It is much smaller and faster than other models while maintaining a similar level of test error. However, it is important to note that the performance of the TTML model may depend on the specific dataset used, and in many practical machine learning problems, its test error may not be as impressive as in the experiment shown. That being said, if speed is a crucial factor in a particular application, the TTML model can be a very competitive option.

As we have seen above, the singular value decomposition (SVD) can be used to find the best low-rank approximation of any matrix. Unfortunately, the SVD is rather expensive to compute, costing \(O(mn^2)\) flops for an \(m\times n\) matrix. Moreover, while SVD can also be used to compute good low-rank TT approximations of any tensor, the cost of the SVD can become prohibitively expensive in this context. Therefore, we need a faster way to compute low-rank matrix approximations.

In my blog post I discussed some iterative methods to compute low-rank approximations using only matrix-vector products. However, there are even faster non-iterative methods that are based on multiplying the matrix of interest with a random matrix.

Specifically, if \(A\) is a rank-\(\hat r\) matrix of size \(m\times n\) and \(x\) is a random matrix of size \(r>\hat r\), then it turns out that the product \(AX\) almost always has the same range as \(A\). This is because multiplying by \(X\) like this doesn’t change the rank of \(A\) unless \(X\) is chosen adversarially. However, since we assume \(X\) is chosen randomly this almost never happens. And here ‘almost never’ is meant in the mathematical sense – i.e., with probability zero. As a result, we have the identity \(\mathcal P_{AX}A =A\), where \(\mathcal P_{AX}\) denotes the orthogonal projection onto \(AX\). This projection matrix can be computed using the QR decomposition of \(AX\), or it can be seen simply as the matrix whose columns form an orthogonal basis of the range of \(AX\).

If \(A\) has rank \(\hat r>r\) however, then \(\mathcal P_{AX}A\neq A\). Nevertheless, we might hope that these two matrices are close, i.e. we may hope that (with probability 1) we have

\[\|\mathcal P_{AX}A \leq C\|A_{\hat r}-A\|,\]for some constant \(C\) that depends only on \(\hat r\) and the dimensions of the problem. Recall that here \(A_{\hat r}\) denotes the best rank-\(\hat r\) approximation of \(A\) (which we can compute using the SVD). It turns out that this is true, and it gives a very simple algorithm for computing a low-rank approximation of a matrix using only \(O(mnr)\) flops – a huge gain if \(r\) is much smaller than the size of the matrix. This is known as the Halko-Martinsson-Tropp (HMT) method, and can be implemented in Python like this:

```
def hmt_approximation(A, r):
m, n = A.shape
X = np.random.normal(size=(n, r))
AX = A @ X
Q, _ = np.linalg.qr(AX)
return Q, Q.T @ A
```

Since \(Q(Q^\top A) = \mathcal P_{AX}A\), this gives a low-rank approximation. It can also be used to obtain an approximate truncated SVD with a minor modification: if we take the SVD \(U\Sigma V^\top = Q^\top A\), then \((QU)\Sigma V^\top\) is an approximate truncated SVD of \(A\). In Python we could implement this like this:

```
def hmt_truncated_svd(A, r):
Q, QtA = hmt_approximation(A, r)
U, S, Vt = np.linalg.svd(QtA)
return Q @ U, S, Vt
```

The HMT method, while efficient, has some drawbacks compared to other randomized methods. For instance, it cannot compute a low-rank decomposition of the sum \(A+B\) of two matrices *in parallel* since the QR decomposition of \((A+B)X\) requires the computation of \((A+B)X\) first. Additionally, if a low-rank approximation \(Q(Q^\top A)\) has already been computed and a small change \(B\) is made to \(A\) to obtain \(A' = A + B\), it is not possible to compute an approximation of the same rank for \(A'\) without redoing most of the work.

The issue arises from the fact that computing a QR decomposition of the product \(AX\) is nonlinear and expensive. One way to address this is by introducing a second random matrix \(Y\) of size \(m\times r\) and computing a decomposition of \(Y^\top AX\). This matrix has a much smaller size of \(r\times r\), allowing for efficient computations if \(r\) is small. Furthermore, computing \(Y^\top(A+B)X\) can be performed entirely in parallel. This results in a low-rank decomposition of the form:

\[A\approx AX(Y^\top AX)^\dagger Y^\top A,\]where \(\dagger\) denotes the *pseudo-inverse*. This is a generalization of the matrix inverse to non-invertible or rectangular matrices, and it can be computed using the SVD of a matrix. If \(A=U\Sigma V^\top\), then \(A^\dagger = V\Sigma^{-1} U^\top\). To compute the inverse of \(\Sigma\), we set \(\Sigma^{-1}[i,i] = 1/(\Sigma[i,i])\) unless one of the diagonal entries is zero, in which case it remains untouched.

The randomized decomposition we discussed is known by different names and has appeared in slightly different forms many times in the literature. In my thesis, I refer to it as the “generalized Nyström” (GN) method. Like the HMT method, it is “quasi-optimal”, which means that it satisfies:

\[\|AX(Y^\top AX)^\dagger Y^\top A - A\|\leq C\|A_{\hat r}-A\|.\]However, there are two technical caveats that need to be discussed. The first is that if we need to choose X and Y to be of different sizes; that is, we must have X of size \(m \times r_R\) and Y of size \(n \times r_L\) with \(r_L \neq r_R\). This is because otherwise \((Y^\top AX)^\dagger\) can have strange behavior. For example, the expected spectral norm \(\mathbb{E} \|(Y^\top AX)^\dagger\|_2\) is infinite if \(r_L=r_R\).

The second caveat is that explicitly computing \((Y^\top AX)^\dagger\) and then multiplying it by \(AX\) and \(Y^\top A\) can lead to numerical instability. However, a product of the form \(A^\dagger B\) is equivalent to the solution of a linear problem of the form \(AX=B\). As a result, we could implement this method in Python as follows:

```
def generalized_nystrom(A, rank_left, rank_right):
m, n = A.shape
X = np.random.normal(size=(n, rank_right))
Y = np.random.normal(size=(m, rank_left))
AX = A @ X
YtA = Y.T @ A
YtAX = Y.T @ AX
L = YtA
R = np.linalg.solve(YtAX, AX, rcond=None)
return L, R
```

Note that this method computes the decomposition implicitly by solving a linear system of equations, which is more stable and efficient than explicitly computing the pseudo-inverse.

Next, we will see how to generalize the GN method to a method for tensor trains. Unfortunately this will get a little technical. Recall that for the GN decomposition we had a decomposition of form

\[AX(Y^\top AX)Y^\top A\]To generalize the GN method to the tensor case, we can take a matricization of the tensor \(\mathcal{T}\) with respect to a mode \(\mu\), which gives a matrix \(\mathcal{T}^{\leq \mu}\) of size \((n_1 \cdots n_\mu) \times (n_{\mu+1} \cdots n_d)\). We can then multiply this matrix on the left and right with random matrices \(Y_\mu\) and \(X_\mu\) of size \((n_1 \cdots n_\mu) \times r_L\) and \((n_{\mu+1} \cdots n_d) \times r_R\), respectively, to obtain the product

\[\Omega_\mu := Y_\mu^\top \mathcal T^{\leq \mu} X_\mu.\]We call this product a *‘sketch’*, and this particular sketch corresponds to the product \(Y^\top AX\) in the matrix case. We visualize the computation of this sketch below:

If we think of a matrix as a tensor with only two modes (legs), then the products \(AX\) and \(Y^\top A\) correspond to multiplying the tensor by matrices such that ‘one mode is left alone’. From this perspective, we can generalize these products to the sketch

\[\Psi_\mu := (Y_{\mu-1}\otimes I_{n_\mu})^\top \mathcal T^{\leq \mu} X_\mu,\]where \(Y_{\mu-1}\) is now a matrix of size \((n_1\dots n_{\mu-1})\times r_L\), also depicted below:

By extension, we can define \(Y_0=X_d=1\), and then the definition of \(\Psi_\mu\) reduces to \(\Psi_1=AX\) and \(\Psi_2=Y^\top A\) in the matrix case. We can therefore rewrite the GN method as

\[AX(Y^\top AX)Y^\top A = \Psi_1\Omega_1^\dagger \Psi_2\]More generally, we can chain the sketches \(\Omega_\mu\) and \(\Psi_\mu\) together to form a tensor network of the following form:

With only minor work, we can turn this tensor network into a tensor train. It turns out that this defines a very useful approximation method for tensor trains. However, it may not be immediately clear why this method gives an approximation to the original tensor. To gain some insight, we can rewrite this decomposition into a different form. We can then see that this approximation boils down to successively applying a series of projections to the original tensor. However, the proof of this fact, as well as the error analysis, is outside the scope of this blog post.

We call this decomposition the *streaming tensor train approximation*, and it has several nice properties. First of all, as its name suggests, it is a *streaming method*. This means that if we have a tensor that decomposes as \(\mathcal T = \mathcal T_1+\mathcal T_2\), then we can compute the approximation for \(\mathcal T_1\) and \(\mathcal T_2\) completely independently, and only spend a small amount of effort at the end of the procedure to combine the results. This is because all the sketches \(\Omega_\mu\) and \(\Psi_\mu\) are linear in the input tensor, and the final step of computing a tensor train from these sketches is very cheap (and in fact even optional).

The decomposition is also *quasi-optimal*. This means that the approximation of this error will (with high probability) lie within a constant factor of the error of the best possible approximation. Unlike the matrix case, however, it is not possible in general to compute the best possible approximation itself in a reasonable time.

The cost of computing this decomposition varies depending on the type of tensor being used. It is easy to derive the cost for the case where \(\mathcal T\) is a ‘dense’ tensor, i.e. just a multidimensional array. However, it rarely makes sense to apply a method like that to such a tensor; usually, we apply it to tensors that are far too big to even store in memory. Instead, we assume that the tensor already has some structure. For example, \(\mathcal T\) could be a CP tensor, a sparse tensor or even a tensor train itself. For each type of tensor, we can then derive a fast way to compute this decomposition, especially if we allow the matrices \(X_\mu\) and \(Y_\mu\) to be structured tensors. The exact implementation of this is however a little technical to discuss here, but safe to say the method is quite fast in practice.

In addition to this decomposition, we also invented a second kind of decomposition (called *OTTS*) that is somewhat of a hybrid generalization of the GN and HMT methods. It is no longer a streaming method, but it is in certain cases significantly more accurate than the previous method, and it can be applied in almost all of the same cases. Finally, there is also a generalization (called *TT-HMT*) of the HMT method to tensor trains that already existed in the literature for a few years that also works in most of the same situations but is also not a streaming method.

Below we compare these three methods – STTA, OTTS and TT-HMT – to a generalization of the truncated SVD (TT-SVD). The latter method is generally expensive to compute but has a very good approximation error, making it an excellent benchmark.

In the plot above we have taken a 10x10x10x10x10 CP tensor and computed TT approximations of different ranks. What we see is that all methods have similar behavior, and are ordered from best to worst approximation as TT-SVD > OTTS > TT-HMT > STTA. This order is something we observe in general across many different experiments, and also in terms of theoretical approximation error. Furthermore, even though these last three methods are all randomized, the variation in approximation error is relatively small, especially for larger ranks. Next, we consider another experiment below:

Here we compare the scaling of the error of the different approximation methods as the order of the tensor increases exponentially. The tensor that we’re approximating, in this case, is always a tensor train with a fast decay in its singular values, and the approximation error is always relative to the TT-SVD method. This is because for such a tensor it is possible to compute the TT-SVD approximation in a reasonable time. While all models have similar error scaling, we see that OTTS is closest in performance to TT-SVD. We can thus conclude that all these methods have their merits; we can use STTA if we work with a stream of data, OTTS if we want a good approximation (especially for very big tensors), and TT-HMT if we just want a good approximation and care more about speed than quality.

This sums up, on a high level, what I did in my thesis. There are plenty of things I could still talk about and plenty of details left out, but this blog post is already quite long and technical. If you’re interested to learn more, you’re welcome to read my thesis or send me a message.

My PhD was a long and fun ride, and I’m now looking back on it with nothing but fondness. I enjoyed my time in Geneva, and I’m going to miss some aspects of the PhD life. I have now started a ‘real’ job, and the things I’m working has very little overlap with the contents of my thesis. However, I hope I will be able to discuss some of the cool things I’m working on now, as well as some new personal projects.

]]>Before we dive into the details of our new type of machine learning model, let’s sit back for a moment and
think: *what is machine learning in the first place?* Machine learning is all about *learning from data*. More
specifically in *supervised machine learning* we are given some *data points* \(X = (x_1,\dots,x_N)\), all lying
in \(\mathbb R^d\), together with *labels* \(y=(y_1,\dots,y_N)\) which are just numbers. We then want to find some
function \(f\colon \mathbb R^d\to \mathbb R\) such that \(f(x_i)\approx y_i\) for all \(i\), and such that \(f\)
*generalizes well to new data*. Or rather, we want to minimize a loss function, for example the least-squares
loss

This is obviously an ill-posed problem, and there are two main issues with it:

- What
*kind*of functions \(f\) are we allowed to choose? - What does it mean to
*generalize*well on new data?

The first issue has no general solution. We *choose* some class of functions, usually that depend on some set
of parameters \(\theta\). For example, if we want to fit a quadratic function to our data we only look at
quadratic functions

and our set of parameters is \(\theta=\{a,b,c\}\). Then we minimize the loss over this set of parameters, i.e. we solve the minimization problem:

\[\min_{a,b,c} \sum_{i=1}^N (a+ bx_i+cx_i^2-y_i)^2.\]There are many parametric families \(f_\theta\) of functions we can choose from, and many different ways to
solve the corresponding minimization problem. For example, we can choose \(f_\theta\) to be neural networks
*with some specified layer sizes*, or a random forest with a fixed number of trees and fixed maximum tree
depth. Note that we should strictly speaking always specify hyperparameters like the size of the layers of a
neural network, since those hyperparameters determine what kind of parameters \(\theta\) we are going to
optimize. That is, hyperparameters affect the parametric family of functions that we are going to optimize.

The second issue, generalization, is typically solved through *cross-validation*. If we want to know whether
the function \(f_\theta\) we learned generalizes well to new data points, we should just keep part of the data
“hidden” during the training (the *test data*). After training we then evaluate our trained function on this
hidden data, and we record the loss function on this test data to obtain the *test loss*. The test loss is
then a good measure of how well the function can generalize to new data, and it is very useful if we want to
compare several different functions trained on the same data. Typically we use a third set of data, the
*validation* dataset for optimizing hyperparameters for example, see my blog post on the topic.

Keeping the general problem of machine learning in mind, let’s consider a particular class of parametric
functions: *discretized functions on a grid*. To understand this class of functions, we first look at the 1D
case. Let’s take the interval \([0,1]\), and chop it up into \(n\) equal pieces:

A discretized function is then one that *takes a constant value on each subinterval*. For example, below is a
discretized version of a sine function:

```
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
DEFAULT_FIGSIZE = (10, 6)
plt.figure(figsize=(DEFAULT_FIGSIZE))
num_intervals = 10
num_plotpoints = 1000
x = np.linspace(0, 1 - 1 / num_plotpoints, num_plotpoints)
def f(x):
return np.sin(x * 2 * np.pi)
plt.plot(x, f(x), label="original function")
plt.plot(
x,
f((np.floor(x * num_intervals) + 0.5) / num_intervals),
label="discretized function",
)
plt.legend();
```

Note that if we divide the interval into \(n\) pieces, then we need \(n\) parameters to describe the discretized function \(f_\theta\).

In the 2D case we instead divide the square \([0,1]^2\) into a grid, and demand that a discretized function is *constant on each grid cell*. If we use \(n\) grid cells for each axis, this gives us \(n^2\) parameters. Let’s see what a discretized function looks like in a 3D plot:

```
fig = plt.figure(figsize=(DEFAULT_FIGSIZE))
num_plotpoints = 200
num_intervals = 5
def f(X, Y):
return X + 2 * Y + 1.5 * ((X - 0.5) ** 2 + (Y - 0.5) ** 2)
X_plotpoints, Y_plotpoints = np.meshgrid(
np.linspace(0, 1 - 1 / num_plotpoints, num_plotpoints),
np.linspace(0, 1 - 1 / num_plotpoints, num_plotpoints),
)
# Smooth plot
Z_smooth = f(X_plotpoints, Y_plotpoints)
ax = fig.add_subplot(121, projection="3d")
ax.plot_surface(X_plotpoints, Y_plotpoints, Z_smooth, cmap="inferno")
plt.title("original function")
# Discrete plot
X_discrete = (np.floor(X_plotpoints * num_intervals) + 0.5) / num_intervals
Y_discrete = (np.floor(Y_plotpoints * num_intervals) + 0.5) / num_intervals
Z_discrete = f(X_discrete, Y_discrete)
ax = fig.add_subplot(122, projection="3d")
ax.plot_surface(X_plotpoints, Y_plotpoints, Z_discrete, cmap="inferno")
plt.title("discretized function");
```

Before diving into higher-dimensional versions of discretized functions, let’s think about how we would solve the learning problem. As mentioned, we have \(n^2\) parameters, and we can encode this using an \(n\times n\) matrix \(\Theta\). We are doing supervised machine learning, so we have data points \(((x_1,y_1),\dots,(x_N,y_N))\) and corresponding labels \((z_1,\dots,z_N)\). Each data point \((x_i,y_i)\) correspond to some entry \((j,k)\) in the matrix \(\Theta\); this is simply determined by the specific grid cell the data point happens to fall in.

If the points \(((x_{i_1},y_{i_1}),\dots,(x_{i_m},y_{i_m}))\) all fall into the grid cell \((j,k)\), then we can define \(\Theta[j,k]\) simply by the mean value of the labels for these points;

\[\Theta[j,k] = \frac{1}{m} \sum_{a=1}^n y_a\]But what do we do if we have no training data corresponding to some entry \(\Theta[j,k]\)? Then the only thing
we can do is make an educated guess based on the entries of the matrix we *do* know. This is the *matrix
completion problem*; we are presented with a matrix with some known entries, and we are tasked to find good
values for the unknown entries. We described this problem in some detail in the previous blog
post.

The main takeaway is this: to solve the matrix completion problem, we need to assume that the matrix has some extra structure. We typically assume that the matrix is of low rank \(r\), that is, we can write \(\Theta\) as a product \(\Theta=A B\) where \(A,B\) are of size \(n\times r\) and \(r\times n\) respectively. Intuitively, this is a useful assumption because now we only have to learn \(2nr\) parameters instead of \(n^2\). If \(r\) is much smaller than \(n\), then this is a clear gain.

From the perspective of machine learning, this changes the class of functions we are considering. Instead of
*all* discretized functions on our \(n\times n\) grid inside \([0,1]^2\), we now consider only those functions
described by a matrix \(\Theta=AB\) that has rank at most \(r\). This also changes the parameters; instead of
\(n^2\) parameters, we now only consider \(2nr^2\) parameters describing the two matrices \(A,B\).

Real data is often not uniform, so unless we use a very coarse grid, some entries of \(\Theta[j,k]\) are always going to be unknown. For example below we show some more realistic data, with the same function as before plus some noise. The color indicates the value of the function \(f\) we’re trying to learn.

```
num_intervals = 8
N = 50
# A function to make somewhat realistic looking 2D data
def non_uniform_data(N):
np.random.seed(179)
X = np.random.uniform(size=N)
X = (X + 0.5) ** 2
X = np.mod(X ** 5 + 0.2, 1)
Y = np.random.uniform(size=N)
Y = (Y + 0.5) ** 3
Y = np.sin(Y * 0.2 * np.pi + 1) + 1
Y = np.mod(Y + 0.6, 1)
X = np.mod(X + 3 * Y + 0.5, 1)
Y = np.mod(0.3 * X + 1.3 * Y + 0.5, 1)
X = X ** 2 + 0.4
X = np.mod(X, 1)
Y = Y ** 2 + 0.5
Y = np.mod(Y + X + 0.4, 1)
return X, Y
# The function we want to model
def f(X, Y):
return X + 2 * Y + 1.5 * ((X - 0.5) ** 2 + (Y - 0.5) ** 2)
X_train, Y_train = non_uniform_data(N)
X_test, Y_test = non_uniform_data(N)
Z_train = f(X_train, Y_train) + np.random.normal(size=X_train.shape) * 0.2
Z_test = f(X_test, Y_test) + np.random.normal(size=X_test.shape) * 0.2
plt.figure(figsize=(7, 6))
plt.scatter(X_train, Y_train, c=Z_train, s=50, cmap="inferno", zorder=3)
plt.colorbar()
# Plot a grid
X_grid = np.linspace(1 / num_intervals, 1, num_intervals)
Y_grid = np.linspace(1 / num_intervals, 1, num_intervals)
plt.xlim(0, 1)
plt.ylim(0, 1)
for perc in X_grid:
plt.axvline(perc, c="gray")
for perc in Y_grid:
plt.axhline(perc, c="gray")
```

We plotted an 8x8 grid on top of the data. We can see that in some grid squares we have a lot of data points, whereas in other squares there’s no data at all. Let’s try to fit a discretized function described by an 8x8 matrix of rank 3 to this data. We can do this using the ttml package I developed.

```
from ttml.tensor_train import TensorTrain
from ttml.tt_rlinesearch import TTLS
rank = 3
# Indices of the matrix Theta for each data point
idx_train = np.stack(
[np.searchsorted(X_grid, X_train), np.searchsorted(Y_grid, Y_train)], axis=1
)
idx_test = np.stack(
[np.searchsorted(X_grid, X_test), np.searchsorted(Y_grid, Y_test)], axis=1
)
# Initialize random rank 3 matrix
np.random.seed(179)
low_rank_matrix = TensorTrain.random((num_intervals, num_intervals), rank)
# Optimize the matrix using iterative method
optimizer = TTLS(low_rank_matrix, Z_train, idx_train)
train_losses = []
test_losses = []
for i in range(50):
train_loss, _, _ = optimizer.step()
train_losses.append(train_loss)
test_loss = optimizer.loss(y=Z_test, idx=idx_test)
test_losses.append(test_loss)
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(train_losses, label="Training loss")
plt.plot(test_losses, label="Test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend()
plt.yscale("log")
print(f"Final training loss: {train_loss:.4f}")
print(f"Final test loss: {test_loss:.4f}")
```

```
Final training loss: 0.0252
Final test loss: 0.0424
```

Above we see how the train and test loss develops during training. At first both train and test loss decrease rapidly. Then both train and test loss start to decrease much more slowly, and training loss is less than test loss. This means that the model overfits on the training data, but this is not necessarily a problem; the question is how much it overfits compared to other models. To see how good this model is, let’s compare it to a random forest.

```
from sklearn.ensemble import RandomForestRegressor
np.random.seed(179)
forest = RandomForestRegressor()
forest.fit(np.stack([X_train, Y_train], axis=1), Z_train)
Z_pred = forest.predict(np.stack([X_test, Y_test], axis=1))
test_loss = np.mean((Z_pred - Z_test) ** 2)
print(f"Random forest test loss: {test_loss:.4f}")
```

```
Random forest test loss: 0.0369
```

We see that the random forest is a little better than the discretized function. And in fact, most standard machine learning estimators will beat a discretized function like this. This is essentially because the discretized function is very simple, and more complicated estimators can do a better job describing the data.

Does this mean that we should stop caring about the discretized function? Test loss is not the only criterion we should use to compare different estimators. Discretized functions like these have two big advantages:

- They use very few parameters compared to many common machine learning estimators.
- Making new predictions is
*very*fast. Much faster in fact than most other machine learning estimators.

This makes them excellent candidates for low-memory applications. For example, we may want to implement a machine learning model for a very cheap consumer device. If we don’t need extreme accuracy, and we pre-train the model on a more powerful device, discretized functions can be a very attractive option.

The generalization to \(d\)-dimensions is now straightforward; we take a \(d\)-dimensional grid on \([0,1]^d\), with
\(n\) subdivisions in each axis. Then we specify the value of our function \(f_\Theta\) on each of the \(n^d\) grid
cells. These \(n^d\) values form a *tensor* \(\Theta\), i.e. a \(d\)-dimensional array. We access the entries of
\(\Theta\) with a \(d\)-tuple of indices \(\Theta[i_1,i_2,\dots,i_d]\).

This suffers from the same problems as in the 2D case; the tensor \(\Theta\) is really big, and during training we would need at least one data point for each entry of the tensor. But the situation is even worse, even storing \(\Theta\) can be prohibitively expensive. For example, if \(d=10\) and \(n=20\); then we would need about 82 TB just to store the tensor! In fact, \(n=20\) grid points in each direction is not even that much, so in practice we might need a much bigger tensor still.

In the 2D case we solved this problem by storing the matrix as the product of two smaller matrices. In the 2D case this doesn’t actually save that much on memory, and we mainly did it so that we can solve the matrix completion problem; that is, so that we can actually fit the discretized function to data. In higher dimensions however, storing the tensor in the right way can save immense amounts of space.

In the 2D case, we store matrices as a low rank matrix; as a product of two smaller matrices. But what is the
correct analogue of ‘low rank’ for tensors? Unfortunately (or fortunately), there are many answers to this
question. There are many ‘low rank tensor formats’, all with very different properties. We will be focusing on
*tensor trains*. A tensor train decomposition of an \(n_1\times n_2\times \dots \times n_d\) tensor \(\Theta\)
consists of a set of \(d\) *cores* \(C_k\) of shape \(r_{k-1}\times n_k \times r_k\), where \((r_1,\dots,r_{d-1})\)
are the *ranks* of the tensor train. Using these cores we can then express the entries of \(\Theta\) using the
following formula:

This may look intimidating, but the idea is actually quite simple. We should think of the core \(C_{k}\) as a
*collection* of \(n_k\) matrices \((C_k[1],\dots,C_k[n_k])\), each of shape \(r_{k-1}\times r_k\). The index \(i_k\)
then *selects* which of these matrices to use. The first and last cores are special, by convention
\(r_0=r_d=1\), this means that \(C_1\) is a collection of \(1\times r_1\) matrices, i.e. (row) vectors. Similarly,
\(C_d\) is a collection of \(r_{d-1}\times 1\) matrices, i.e. (column) vectors. Thus each entry of \(\Theta\) is
determined by a product like this:

row vector * matrix * matrix * … * matrix * column vector

The result is a number, since a row/column vector times a matrix is a row/column vector, and the product of a row and column vector is just a number. In fact, if we think about it, this is exactly how a low-rank matrix decomposition works as well. If we write a matrix \(\Theta = AB\), then

\[\Theta[i,j]=\sum_k A[i,k] B[k,j] = A[i,:]\cdot B[:,j].\]Here \(A[i,:]\) is a *row* of \(A\), and \(B[:,j]\) is a *column* of \(B\). In other words, \(A\)
is just a collection of row vectors, and \(B\) is just a collection of column vectors. Then to obtain an entry
\(\Theta[i,j]\), we select the \(i\text{th}\) row of \(A\) and the \(j\text{th}\) column of \(B\) and take the product.

In summary, a tensor train is a way to cheaply store large tensors. Assuming all ranks \((r_1,\dots,r_{d-1})\) are the same, a tensor train requires \(O(dr^2n)\) entries to store a tensor with \(O(n^d)\) entries; a huge gain if \(d\) and \(n\) are big. For context, if \(d=10\), \(n=20\), and \(r=10\) then instead of 82 TB we just need 131 KB to store the tensor; that’s about 9 orders of magnitude cheaper! Furthermore, computing entries of this tensor is cheap; it’s just a couple matrix-vector products.

There is obviously a catch to this. Just like not every matrix is low-rank, not every tensor can be
represented by a low-rank tensor train. The point, however, is that tensor trains can efficiently represent
many tensors that we *do* care about. In particular, they are good at representing the tensors required for
discretized functions.

How can we learn a discretized function \([0,1]^d\to \mathbb R\) represented by a tensor train? Like in the
matrix case, many entries of the the tensor are unobserved, and we have to *complete* these entries based on
the entries that we *can* estimate. In my post on matrix completion we have seen that even
the matrix case is tricky, and there are many algorithms to solve the problem. One thing these algorithms have
in common is that they are iterative algorithms minimizing some loss function. Let’s derive such an algorithm
for *tensor train completion*.

First of all, what is the loss function we want to minimize during training? It’s simply the least squares loss:

\[L(\Theta) = \sum_{j=1}^N(f_\Theta(x_j) - y_j)^2\]Each data point \(x_j\in [0,1]^d\) fits into some grid cell given by index \((i_1[j],i_2[j],\dots,i_d[j])\), so using the definition of the tensor train the loss \(L(\Theta)\) becomes

\[\begin{align*} L(\Theta) &= \sum_{j=1}^N (\Theta[i_1[j],i_2[j],\dots,i_d[j]] - y_j)^2\\ &= \sum_{j=1}^N(C_1[1,i_1[j],:]C_2[:,i_2[j],:]\cdots C_d[:,i_d[j],1] - y_j)^2 \end{align*}\]A straightforward approach to minimizing \(L(\Theta)\) is to just use *gradient descent*. We could compute the
derivatives with respect to each of the cores \(C_i\) and just update the cores using this derivative. This is,
however, very slow. There are two reasons for this, but they are a bit subtle:

*There is a lot of curvature.*In gradient descent, the size of step we can optimally take is depended on how big the*second derivatives*of a function are (the*‘curvature’*). The derivative of a function is the*best linear approximation*of a function, and gradient descent works faster if this linear approximation is a good approximation of the function. In this case, the function we are trying to optimize is*very non-linear*, and any linear approximation is going to be very bad. Therefore we are forced to take really tiny steps during gradient descent, and convergence is going to be very slow.*There are a lot of symmetries.*For example we can replace \(C_i\) and \(C_{i+1}\) with \(C_i M\) and \(A^{-1}C_{i+1}\) for any matrix \(A\). Gradient descent ‘doesn’t know’ about these symmetries, and keeps updating \(\Theta\) in directions that doesn’t affect \(L(\Theta)\).

To efficiently optimize \(L(\theta)\), we can’t just use gradient descent as-is, and we are forced to walk a
different route. While \(L(\Theta)\) is very non-linear as function of the tensor train cores \(C_i\), it is only
quadratic in the *entries* of \(L(\Theta)\), and we can easily compute its derivative:

where \(E(i_1,i_2,\dots,i_d)\) denotes a sparse tensor that’s zero in all entries *except* \((i_1,\dots,i_d)\)
where it takes value \(1\). In other words, \(\nabla_{\Theta}L(\Theta)\) is a *sparse tensor* that is both simple
and cheap to compute; it just requires sampling at most \(N\) entries of \(\Theta\). For gradient descent we would
then update \(\Theta\) by \(\Theta-\alpha \nabla_{\Theta}L(\Theta)\) with \(\alpha\) the stepsize. Unfortunately,
this expression is not a tensor train. However, we can try to *approximate*
\(\Theta-\alpha \nabla_{\Theta}L(\Theta)\) by a tensor train of the same rank as \(\Theta\).

Recall that we can approximate a matrix \(A\) by a rank \(r\) matrix by using the *truncated SVD* of \(A\). In fact
this is the best-possible approximation of \(A\) by a rank \(\leq r\) matrix. There is a similar procedure for
tensor trains; we can approximate a tensor \(\Theta\) by a rank \((r_1,\dots,r_{d-1})\) tensor train using the
TT-SVD procedure. While this is not the *best* approximation of \(\Theta\) by such a tensor train, it is
*‘quasi-optimal’* and pretty good in practice. The details of the TT-SVD procedure are a little involved, so
let’s leave it as a black box. We now have the following iterative procedure for optimizing \(L(\Theta)\):

If you’re familiar with optimizing neural networks, you might notice that this procedure could work very well
with *stochastic gradient descent*. Indeed \(\nabla_{\Theta}L(\Theta)\) is a sum over all the data points, so we
can just pick a subset of data points (a minibatch) to obtain a stochastic gradient. The reason we would want
to do this is that we have so many data points that the cost of each step is dominated by computing the
gradient. In this situation this is however not true, and the cost is dominated by the TT-SVD procedure. We
therefore stick to more classical gradient descent methods. In particular, the function \(L(\theta)\) can be
optimized well with conjugate gradient descent using Armijo backtracking line search.

Let’s now see all of this in practice. Let’s train a discretized function \(f_\Theta\) represented by a tensor
train on some data using the technique described above. We will do this on a real dataset: the airfoil
self-noise dataset. This NASA dataset contains
experimental data about the self-noise of airfoils in a wind tunnel, originally used to optimize wing shapes.
We can do the fitting and optimization using my `ttml`

package. Let’s use a rank 5 tensor train with 10 grid
points for each feature.

```
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
# Load the data
airfoil_data = pd.read_csv(
"airfoil_self_noise.dat", sep="\t", header=None
).to_numpy()
y = airfoil_data[:, 5]
X = airfoil_data[:, :5]
N, d = X.shape
print(f"Dataset has {N=} samples and {d=} features.")
# Do train-test split, and scale data to interval [0,1]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=179
)
scaler = MinMaxScaler(clip=True)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Define grid, and find associated indices for each data point
num_intervals = 10
grids = [np.linspace(1 / num_intervals, 1, num_intervals) for _ in range(d)]
tensor_shape = tuple(len(grid) for grid in grids)
idx_train = np.stack(
[np.searchsorted(grid, X_train[:, i]) for i, grid in enumerate(grids)],
axis=1,
)
idx_test = np.stack(
[np.searchsorted(grid, X_test[:, i]) for i, grid in enumerate(grids)],
axis=1,
)
# Initialize the tensor train
np.random.seed(179)
rank = 5
tensor_train = TensorTrain.random(tensor_shape, rank)
# Optimize the tensor train using iterative method
optimizer = TTLS(tensor_train, y_train, idx_train)
train_losses = []
test_losses = []
for i in range(100):
train_loss, _, _ = optimizer.step()
train_losses.append(train_loss)
test_loss = optimizer.loss(y=y_test, idx=idx_test)
test_losses.append(test_loss)
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(train_losses, label="Training loss")
plt.plot(test_losses, label="Test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend()
plt.yscale("log")
print(f"Final training loss: {train_loss:.4f}")
print(f"Final test loss: {test_loss:.4f}")
```

```
Dataset has N=1503 samples and d=5 features.
Final training loss: 15.3521
Final test loss: 54.4698
```

We see a similar training profile to the matrix completion case. Let’s see now how this estimator compares to a random forest trained on the same data:

```
np.random.seed(179)
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
test_loss = np.mean((y_pred - y_test) ** 2)
print(f"Random forest test loss: {test_loss:.4f}")
```

```
Random forest test loss: 3.2568
```

The random forest has a loss of around `3.3`

, but the discretized function has a loss of around `54.5`

! That gap in performance is completely unacceptable. We could try to improve it by increasing the number of grid points, and by tweaking the rank of the tensor train. However, it will still come nowhere close to the performance of a random forest, even with its default parameters. Even the *training error* of the discretized function is much worse than the *test error* of the random forest.

**Why is it so bad?** *Bad initialization!*

Recall that a gradient descent method converges to a *local* minimum of the function. Usually we hope that whatever local minimum we converge to is ‘good’. Indeed for neural networks we see that, especially if we use a lot of parameters, most local minima found by stochastic gradient descent are quite good, and give a low train *and* test error. This is not true for our discretized function. We converge to local minima that have both bad train and test error.

**The solution?** *Better initialization!*

Instead of initializing the tensor trains *randomly*, we can learn from other machine learning estimators. We
fit our favorite machine learning estimator (e.g. a neural network) to the training data. This gives a function
\(g\colon [0,1]^d\to \mathbb R\). This function is defined for *any* input, not just for the training/test data
points. Therefore we can try to first fit our discretized function \(f_\Theta\) to match \(g\), i.e. we solve the
following minimization problem:

One way to solve this minimization problem is by first (randomly) sampling a lot of new data points
\((x_1,\dots,x_N)\in [0,1]^d\) and then fitting \(f_\Theta\) to these data points with labels
\((g(x_1),\dots,g(x_N))\). This is essentially *data augmentation*, and can drastically increase the *number* of
data points available for training. With more training data, the function \(f_\Theta\) will indeed converge to a
better local minimum.

While data augmentation does improve performance, we can do better. We don’t need to *randomly* sample data
points \((x_1,\dots,x_N)\in[0,1]^d\). Instead we can *choose* good points to sample; points that give us the
most information on how to efficiently update the tensor train. This is essentially the idea behind the
*tensor train cross approximation* algorithm, or TT-Cross for short. Using TT-Cross we can quickly and
efficiently get a good approximation to the minimization problem \(\min_\Theta \|f_\Theta - g\|^2\).

We could stop here. If \(g\) models our data really well, and \(f_\Theta\) approximates \(g\) really well, then we
should be happy. Like the matrix completion model, discretized functions based on tensor trains are *fast* and
are *memory efficient*. Therefore we can make an approximation of \(g\) that uses less memory and can make
faster predictions! However, the model \(g\) really should be used for *initialization* only. Usually \(f_\Theta\)
actually doesn’t do a great job of approximating \(g\), but if we first approximate \(g\), and *then* use a
gradient descent algorithm to improve \(f_\Theta\) even further, we end up with something much more competitive.

Let’s see this in action. This is actually much easier than what we did before, because I wrote the `ttml`

package specifically for this use case.

```
from ttml.ttml import TTMLRegressor
# Use random forest as base estimator
forest = RandomForestRegressor()
# Fit tt on random forest, and then optimize further on training data
np.random.seed(179)
tt = TTMLRegressor(forest, max_rank=5, opt_tol=None)
tt.fit(X_train, y_train, X_val=X_test, y_val=y_test)
y_pred = tt.predict(X_test)
test_loss = np.mean((y_pred - y_test) ** 2)
print(f"TTML test loss: {test_loss:.4f}")
# Forest is fit on same data during fitting of tt
# Let's also report how good the forest does
y_pred_forest = forest.predict(X_test)
test_loss_forest = np.mean((y_pred_forest - y_test) ** 2)
print(f"Random forest test loss: {test_loss_forest:.4f}")
# Training and test loss is also recording during optimization, let's plot it
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(tt.history_["train_loss"], label="Training loss")
plt.plot(tt.history_["val_loss"], label="Test loss")
plt.axhline(test_loss_forest, c="g", ls="--", label="Random forest test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend()
plt.yscale("log")
```

```
TTML test loss: 2.8970
Random forest test loss: 3.2568
```

We see that using a random forest for initialization gives a huge improvement to both training and test loss. In fact,the final test loss is better than that of the random forest itself! On top of that, this estimator doesn’t use many parameters:

```
print(f"TT uses {tt.ttml_.num_params} parameters")
```

```
TT uses 1356 parameters
```

Let’s compare that to the random forest. If we look under the hood, the scikit-learn implementation of random forests stores 8 parameters per node in each tree in the forest. This is inefficient, and you really only *need* 2 parameters per node, so let’s use that.

```
num_params_forest = sum(
[len(tree.tree_.__getstate__()["nodes"]) * 2 for tree in forest.estimators_]
)
print(f"Forest uses {num_params_forest} parameters")
```

```
Forest uses 303180 parameters
```

That’s 1356 parameters vs. more than 300,000 parameters! What about my claim of prediction speed? Let’s compare the amount of time it takes both estimators to predict 1 million samples. We do this by just concatenating the training data until we get 1 million samples.

```
from time import perf_counter_ns
target_num = int(1e6)
n_copies = int(target_num//len(X_train))+1
X_one_million = np.repeat(X_train,n_copies,axis=0)[:target_num]
print(f"{X_one_million.shape=}")
time_before = perf_counter_ns()
tt.predict(X_one_million)
time_taken = (perf_counter_ns() - time_before)/1e6
print(f"Time taken by TT: {time_taken:.0f}ms")
time_before = perf_counter_ns()
forest.predict(X_one_million)
time_taken = (perf_counter_ns() - time_before)/1e6
print(f"Time taken by Forest: {time_taken:.0f}ms")
```

```
X_one_million.shape=(1000000, 5)
Time taken by TT: 430ms
Time taken by Forest: 2328ms
```

While not by orders of magnitude, we see that the tensor train model is faster. You might be thinking that this is just because the tensor train has fewer parameters, but this is not the case. Even if we use a very high-rank tensor train with high-dimensional data, it is still going to be fast. The speed scales really well, and will beat most conventional machine learning estimators.

With good initialization the model based on distretized functions perform really well. On our test dataset the
model is fast, uses few parameters, and beats a random forest in test loss (in fact, it is *the best
estimator* I have found so far for this problem). This is great! I should publish a paper in NeurIPS and get a
job at Google! Well… let’s not get ahead of ourselves. It performs well on *this particular dataset*, yes,
but how does it fare on other data?

As we shall see, it doesn’t do all that well actually. The airfoil self-noise dataset is a very particular dataset on which this algorithm excels. The model seems to perform well on data that can be described by a somewhat smooth function, and doesn’t deal well with the noisy and stochastic nature of most data we encounter in the real world. As an example let’s repeat the experiment, but let’s first add some noise:

```
from ttml.ttml import TTMLRegressor
X_noise_std = 1e-6
X_train_noisy = X_train + np.random.normal(0, X_noise_std, size=X_train.shape)
X_test_noisy = X_test + np.random.normal(scale=X_noise_std, size=X_test.shape)
# Use random forest as base estimator
forest = RandomForestRegressor()
# Fit tt on random forest, and then optimize further on training data
np.random.seed(179)
tt = TTMLRegressor(forest, max_rank=5, opt_tol=None, opt_steps=50)
tt.fit(X_train_noisy, y_train, X_val=X_test_noisy, y_val=y_test)
y_pred = tt.predict(X_test_noisy)
test_loss = np.mean((y_pred - y_test) ** 2)
print(f"TTML test loss (noisy): {test_loss:.4f}")
# Forest is fit on same data during fitting of tt
# Let's also report how good the forest does
y_pred_forest = forest.predict(X_test_noisy)
test_loss_forest = np.mean((y_pred_forest - y_test) ** 2)
print(f"Random forest test loss (noisy): {test_loss_forest:.4f}")
# Training and test loss is also recording during optimization, let's plot it
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(tt.history_["train_loss"], label="Training loss")
plt.plot(tt.history_["val_loss"], label="Test loss")
plt.axhline(test_loss_forest, c="g", ls="--", label="Random forest test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend();
```

```
TTML test loss (noisy): 7.1980
Random forest test loss (noisy): 5.1036
```

Even a tiny bit of noise in the training data can severely degrade the model. We see that it starts to overfit a lot. This is because my algorithm tries to automatically find a ‘good’ discretization of the data, not just a uniform discretization as we have discussed in our 2D example (i.e. equally spacing all the grid cells). Some of the variables in this dataset are however categorical, and a small amount of noise makes it much more difficult to automatically detect a good way to discretize them.

The model has a lot of hyperparameters we won’t go into now, and playing with them does help with overfitting. Furthermore, the noisy data we show here is perhaps not very realistic. However, the fact remains that the model (at least the way its currently implemented) is not very robust to noise. In particular, the model is very sensitive to the discretization of the feature space used.

Right now we don’t have anything better than simple heuristics for finding discretizations of the features space. Since the loss function depends in a really discontinuous way on the discretization, optimizing the discretization is difficult. Perhaps we can use an algorithm to adaptively split and merge thresholds used in the discretization, or use some kind of clustering algorithm for discretization. I have tried things along those lines but getting it to work well is difficult. I think that with more study, the problem of finding a good discretization can be solved, but it’s not easy.

We looked at discretized functions and their use in supervised machine learning. In higher dimensions discretized functions are parametrized by tensors, which we can represent efficiently using tensor trains. The tensor train can be optimized directly on the data to produce a potentially useful machine learning model. It is both very fast, and doesn’t use many parameters. In order to initialize it well, we can first fit an auxiliary machine learning model on the same data, and then sample predictions from that model to effectively increase the amount of training data. This model performs really well on some datasets, but in general it is not very robust to noise. As a result, without further improvements, the model will only be useful in a select number of cases. On the other hand, I really think that the model does have a lot of potential, once some of its drawbacks are fixed.

]]>The linear least-squares problem is one of the most common minimization problems we encounter. It takes the following form:

\[\min_x \|Ax-b\|^2\]Here \(A\) is an \(n\times n\) matrix, and \(x,b\in\mathbb R^{n}\) are vectors. If \(A\) is invertible, then this
problem has a simple, unique solution: \(x = A^{-1}b\). However, there are two big reasons why we should *almost never*
use \(A^{-1}\) to solve the least-squares problem in practice:

- It is expensive to compute \(A^{-1}\).
- This solution numerically unstable.

Assuming \(A\) doesn’t have any useful structure, point 1. is not that bad. Solving the least-squares problem in
a smart way costs \(O(n^3)\), and doing it using matrix-inversion also costs \(O(n^3)\), just with a larger hidden
constant. The real killer is the instability. To see this in action, let’s take a matrix that is *almost
singular*, and see what happens when we solve the least-squares problem.

```
import numpy as np
np.random.seed(179)
n = 20
# Create almost singular matrix
A = np.eye(n)
A[0, 0] = 1e-20
A = A @ np.random.normal(size=A.shape)
# Random vector b
b = A @ np.random.normal(size=(n,)) + 1e-3 * np.random.normal(size=n)
# Solve least-squares with inverse
A_inv = np.linalg.inv(A)
x = A_inv @ b
error = np.linalg.norm(A @ x - b) ** 2
print(f"error for matrix inversion method: {error:.4e}")
# Solve least-squares with dedicated routine
x = np.linalg.lstsq(A, b, rcond=None)[0]
error = np.linalg.norm(A @ x - b) ** 2
print(f"error for dedicated method: {error:.4e}")
```

```
error for matrix inversion method: 3.6223e+02
error for dedicated method: 2.8275e-08
```

In this case we took a 20x20 matrix \(A\) with ones on the diagonals, except for one entry where it has value
`1e-20`

, and then we shuffled everything around by multiplying by a random matrix. The entries of \(A\) are
not so big, but the entries of \(A^{-1}\) will be *gigantic*. This results in the fact that the solution
obtained as \(x=A^{-1}b\) does not satisfy \(Ax=b\) in practice. The solution found by using the `np.linalg.lstsq`

routine is much better.

The reason that the inverse-matrix method fails badly in this case can be summarized using the *condition
number* \(\kappa(A)\). It expresses how much the error \(\|Ax-b\|\) with \(x=A^{-1}b\) is going to change if we
change \(b\) slightly, in the worst case. The condition number gives a notion of how much numerical errors get
amplified when we solve the linear system. We can compute it as the ratio between the smallest and largest
singular values of the matrix \(A\):

In the case above the condition number is really big:

```
np.linalg.cond(A)
```

```
1.1807555508404976e+16
```

Large condition numbers mean that *any* numerical method is going to struggle to give a good solution, but for
numerically unstable methods the problem is a lot worse.

While the numerical stability of algorithms is a fascinating topic, it is not what we came here for today.
Instead, let’s revisit the first reason why using matrix inversion for solving linear problems is bad. I
mentioned that matrix inversion and better alternatives take \(O(n^3)\) to solve the least squares problem
\(\min_a\|Ax-b\|^2\), *if there is no extra structure on* \(A\) *that we can exploit*.

What if there *is* such structure? For example, what if \(A\) is a huge sparse matrix? For example the Netflix
dataset we considered in this blog post is of size 480,189 x 17,769. Putting aside the
fact that it is not square, inverting matrices of that kind of size is infeasible. Moreover, the inverse
matrix isn’t necessarily sparse anymore, so we lose that valuable structure as well.

Another example arose in my first post on deconvolution. There we tried to solve the linear problem

\[\min_x \|k * x -y\|^2\]where \(k * x\) denotes *convolution*. Convolution is a linear operation, but requires only \(O(n\log n)\) to
compute, whereas writing it out as a matrix would require \(n\times n\) entries, which can quickly become too
large.

In situations like this, we have no choice but to devise an algorithm that makes use of the structure of \(A\).
What the two situations above have in common is that storing \(A\) as a dense matrix is expensive, but computing
matrix-vector products \(Ax\) is cheap. The algorithm we are going to come up with is going to be *iterative*;
we start with some initial guess \(x_0\), and then improve it until we find a solution of the desired accuracy.

We don’t have much to work with; we have a vector \(x_0\) and the ability fo compute matrix-vector products.
Crucially, we assumed our matrix \(A\) is *square*. This means that \(x_0\) and \(Ax_0\) have the same shape, and
therefore we can also compute \(A^2x_0\), or in fact \(A^rx_0\) for any \(r\). The idea is then to try to express
the solution to the least-squares problem as linear combination of the vectors

This results in a class of algorithms known as *Krylov subspace methods*. Before diving further into how they work, let’s see one in action. We take a 2500 x 2500 sparse matrix with 5000 non-zero entries (which includes the entire diagonal).

```
import scipy.sparse
import scipy.sparse.linalg
import matplotlib.pyplot as plt
from time import perf_counter_ns
np.random.seed(179)
n = 2500
N = n
shape = (n, n)
# Create random sparse (n, n) matrix with N non-zero entries
coords = np.random.choice(n * n, size=N, replace=False)
coords = np.unravel_index(coords, shape)
values = np.random.normal(size=N)
A_sparse = scipy.sparse.coo_matrix((values, coords), shape=shape)
A_sparse = A_sparse.tocsr()
A_sparse += scipy.sparse.eye(n)
A_dense = A_sparse.toarray()
b = np.random.normal(size=n)
b = A_sparse @ b
# Solve using np.linalg.lstsq
time_before = perf_counter_ns()
x = np.linalg.lstsq(A_dense, b, rcond=None)[0]
time_taken = (perf_counter_ns() - time_before) * 1e-6
error = np.linalg.norm(A_dense @ x - b) ** 2
print(f"Using dense solver: error: {error:.4e} in time {time_taken:.1f}ms")
# Solve using inverse matrix
time_before = perf_counter_ns()
x = np.linalg.inv(A_dense) @ x
time_taken = (perf_counter_ns() - time_before) * 1e-6
error = np.linalg.norm(A_dense @ x - b) ** 2
print(f"Using matrix inversion: error: {error:.4e} in time {time_taken:.1f}ms")
# Solve using GMRES
time_before = perf_counter_ns()
x = scipy.sparse.linalg.gmres(A_sparse, b, tol=1e-8)[0]
time_taken = (perf_counter_ns() - time_before) * 1e-6
error = np.linalg.norm(A_sparse @ x - b) ** 2
print(f"Using sparse solver: error: {error:.4e} in time {time_taken:.1f}ms")
```

```
Using dense solver: error: 1.4449e-25 in time 2941.5ms
Using matrix inversion: error: 2.4763e+03 in time 507.0ms
Using sparse solver: error: 2.5325e-13 in time 6.4ms
```

As we see above, the sparse matrix solver solves this problem in a fraction of the time, and the difference is just going to get bigger with larger matrices. Above we use the GMRES routine, and it is very simple. It constructs an orthonormal basis of the Krylov subspace \(\mathcal K_m(A,x_0)\), and then finds the best solution in this subspace by solving a small \((m+1)\times m\) linear system. Before figuring out the details, below is a simple implementation:

```
def gmres(linear_map, b, x0, n_iter):
# Initialization
n = x0.shape[0]
H = np.zeros((n_iter + 1, n_iter))
r0 = b - linear_map(x0)
beta = np.linalg.norm(r0)
V = np.zeros((n_iter + 1, n))
V[0] = r0 / beta
for j in range(n_iter):
# Compute next Krylov vector
w = linear_map(V[j])
# Gram-Schmidt orthogonalization
for i in range(j + 1):
H[i, j] = np.dot(w, V[i])
w -= H[i, j] * V[i]
H[j + 1, j] = np.linalg.norm(w)
# Add new vector to basis
V[j + 1] = w / H[j + 1, j]
# Find best approximation in the basis V
e1 = np.zeros(n_iter + 1)
e1[0] = beta
y = np.linalg.lstsq(H, e1, rcond=None)[0]
# Convert result back to full basis and return
x_new = x0 + V[:-1].T @ y
return x_new
# Try out the GMRES routine
time_before = perf_counter_ns()
x0 = np.zeros(n)
linear_map = lambda x: A_sparse @ x
x = gmres(linear_map, b, x0, 50)
time_taken = (perf_counter_ns() - time_before) * 1e-6
error = np.linalg.norm(A_sparse @ x - b) ** 2
print(f"Using GMRES: error: {error:.4e} in time {time_taken:.1f}ms")
```

```
Using GMRES: error: 1.1039e-15 in time 12.9ms
```

This clearly works; it’s not as fast as the `scipy`

implementation of the same algorithm, but we’ll do something about that soon.

Let’s take a more detailed look at what the GMRES algorithm is doing. We iteratively define an orthonormal basis \(V_m = \{v_0,v_1,\dots,v_{m-1}\}\). We start with \(v_0 = r_0 / \|r_0\|\), where \(r_0 = b-Ax_0\) is the *residual* of the initial guess \(x_0\). In each iteration we then set \(w = A v_j\), and take \(v_{j+1} = w - \sum_i (w^\top v_{i})v_i\); i.e. we ensure \(v_{j+1}\) is orthogonal to all previous \(v_0,\dots,v_j\). Therefore \(V_m\) is an orthonormal basis of the Krylov subspace \(\mathcal K_m(A,r_0)\).

Once we have this basis, we want to solve the minimization problem:

\[\min_{x\in \mathcal K_m(A,r_0)} \|A(x_0+x)-b\|\]Since \(V_m\) is a basis, we can write \(x = V_m y\) for some \(y\in \mathbb R^m\). Also note that in this basis \(b-Ax_0 = r_0 = \beta v_0 = \beta V_m e_1\) where \(\beta = \|r_0\|\) and \(e_1= (1,0,\dots,0)\). This allows us to rewrite the minimization problem:

\[\min_{y\in\mathbb R^m} \|AV_my - \beta V_me_1\|\]To solve this minimization problem we need one more trick. In the algorithm we computed a matrix \(H\), it is defined like this:

\[H_{ij} = v_i^\top (Av_j-\sum_k H_{kj}v_k) = v_i^\top A v_j\]These are precisely the coefficients of the Gram-Schmidt orthogonalization, and hence \(A v_j = \sum_{i=1}^{j+1} H_{ij}v_i\), giving the matrix equality \(AV_m = HV_m\). Now we can rewrite the minimization problem even further and get

\[\min_{y\in\mathbb R^m} \|V_m (Hy - \beta e_1)\| = \min_{y\in\mathbb R^m} \|Hy - \beta e_1\|\]The minimization problem is therefore reduced to an \((m+1)\times m\) problem! The cost of this is \(O(m^3)\), and as long as we don’t use too many steps \(m\), this cost can be very reasonable. After solving for \(y\), we then get the estimate \(x = x_0 + V_m y\).

In the current implementation of GMRES we specify the number of steps in advance, which is not ideal. If we converge to the right solution in less steps, then we are doing unnecessary work. If we don’t get a satisfying solution after the specified number of steps, we might need to start over. This is however not a big problem; we can use the output \(x=x_0+V_my\) as new initialization when we restart.

This gives a nice recipe for *GMRES with restarting*. We run GMRES for \(m\) steps with \(x_i\) as initialization to get a new estimate \(x_{i+1}\). We then check if \(x_{i+1}\) is good enough, if not, we repeat the GMRES procedure for another \(m\) steps.

It is possible to get a good estimate of the residual norm after *each* step of GMRES, not just every \(m\) steps. However, this is relatively technical to implement, so we will just consider the variation of GMRES with restarting.

How often should we restart? This really depends on the problem we’re trying to solve, since there is a
trade-off. More steps in between each restart will typically result in convergence in fewer steps, *but* it is
more expensive and also requires more memory. The computational cost scales as \(O(m^3)\), and the memory cost
scales linearly in \(m\) (if the matrix size \(n\) is much bigger than \(m\)). Let’s see this trade-off in action on a model problem.

Recall that the deconvolution problem is of the following form:

\[\min_x \|k * x -y\|^2\]for a fixed *kernel* \(k\) and signal \(y\). The convolution operation \(k*x\) is linear in \(x\), and we can
therefore treat this as a linear least-squares problem and solve it using GMRES. The operation \(k*x\) can be
written in matrix form as \(Kx\), where \(K\) is a matrix. For large images or signals, the matrix \(K\) can be
gigantic, and we never want to explicitly store \(K\) in memory. Fortunately, GMRES only cares about
matrix-vector products \(Kx\), making this a very good candidate to solve with GMRES.

Let’s consider the problem of sharpening (deconvolving) a 128x128 picture blurred using Gaussian blur. To make the problem more interesting, the kernel \(k\) used for deconvolution will be slightly different from the kernel used for blurring. This is inspired by the blind deconvolution problem, where we not only have to find \(x\), but also the kernel \(k\) itself.

We solve this problem with GMRES using different number of steps between restarts, and plot how the error evolves over time.

```
from matplotlib import image
from utils import random_motion_blur
from scipy.signal import convolve2d
# Define the Gaussian blur kernel
def gaussian_psf(sigma=1, N=9):
gauss_psf = np.arange(-N // 2 + 1, N // 2 + 1)
gauss_psf = np.exp(-(gauss_psf ** 2) / (2 * sigma ** 2))
gauss_psf = np.einsum("i,j->ij", gauss_psf, gauss_psf)
gauss_psf = gauss_psf / np.sum(gauss_psf)
return gauss_psf
# Load the image and blur it
img = image.imread("imgs/vitus128.png")
gauss_psf_true = gaussian_psf(sigma=1, N=11)
gauss_psf_almost = gaussian_psf(sigma=1.05, N=11)
img_blur = convolve2d(img, gauss_psf_true, mode="same")
# Define the convolution linear map
linear_map = lambda x: convolve2d(
x.reshape(img.shape), gauss_psf_almost, mode="same"
).reshape(-1)
# Apply GMRES for different restart frequencies and measure time taken
total_its = 2000
n_restart_list = [20, 50, 200, 500]
losses_dict = dict()
for n_restart in n_restart_list:
time_before = perf_counter_ns()
b = img_blur.reshape(-1)
x0 = np.zeros_like(b)
x = x0
losses = []
for _ in range(total_its // n_restart):
x = gmres(linear_map, b, x, n_restart)
error = np.linalg.norm(linear_map(x) - b) ** 2
losses.append(error)
time_taken = (perf_counter_ns() - time_before) / 1e9
print(f"Best loss for {n_restart} restart frequency is {error:.4e} in {time_taken:.2f}s")
losses_dict[n_restart] = losses
```

```
Best loss for 20 restart frequency is 9.3595e-16 in 11.32s
Best loss for 50 restart frequency is 2.4392e-22 in 11.71s
Best loss for 200 restart frequency is 6.3063e-28 in 17.34s
Best loss for 500 restart frequency is 6.9367e-28 in 30.50s
```

We observe that with all restart frequencies we converge to a result with very low error. The larger the
number of steps between restarts, the faster we converge. Remember however that the cost of GMRES rises as
\(O(m^3)\) with the number of steps \(m\) between restarts, so a larger number of steps is not always better. For
example we see that \(m=20\) and \(m=50\) produced almost identical runtime, but for \(m=200\) the runtime for 2000
total steps is already significantly bigger, and the effect is even bigger for \(m=500\). This means that if we
want to get converge as fast as possible *in terms of runtime*, we’re best off with somewhere between \(m=50\)
and \(m=200\) steps between each reset.

If we do simple profiling, we see that almost all of the time in this function is spent on the 2D convolution. Indeed this is why the runtime does not seem to scale os \(O(m^3)\) for the values of \(m\) we tried above. It simply takes a while before the \(O(m^3)\) factor becomes dominant over the time spent by matrix-vector products.

This also means that it should be straightforward to speed up – we just need to do the convolution on a GPU. It is not as simple as that however; if we just do the convolution on GPU and the rest of the operations on CPU, then the bottleneck quickly becomes moving the data between CPU and GPU (unless we are working on a system where CPU and GPU share memory).

Fortunately the entire GMRES algorithm is not so complex, and we can use hardware acceleration by simply translating the algorithm to use a fast computational library. There are several such libraries available for Python:

- TensorFlow
- PyTorch
- DASK
- CuPy
- JAX
- Numba

In this context CuPy might be the easiest to use; its syntax is very similar to numpy. However, I would also like to make use of JIT (Just-in-time) compilation, particularly since this can limit unnecessary data movement. Furthermore, it really depends on the situation which low-level CUDA functions are best called in different situations (especially for something like convolution), and JIT compilation can offer significant optimizations here.

TensorFlow, DASK and PyTorch are really focussed on machine-learning and neural networks, and the way we interact with these libraries might not be the best for this kind of algorithm. In fact, I tried to make an efficient GMRES implementation using these libraries, and I was really struggling; I feel these libraries simply aren’t the right tool for this job.

Numba is also great, I could basically feed it the code I already wrote and it would probably compile the
function and make it several times faster *on CPU*. Unfortunately, the support for GPU is still lacking quite
a bit in Numba, and we would therefore still leave quite a bit of performance on the table.

In the end we will implement it in JAX. Like CuPy, it has an API very similar to numpy which means it’s easy to translation. However, it also supports JIT, meaning we can potentially make much faster functions. Without further ado, let’s implement the GMRES algorithm in JAX and see what kind of speedup we can get.

```
import jax.numpy as jnp
import jax
# Define the linear operator
img_shape = img.shape
def do_convolution(x):
return jax.scipy.signal.convolve2d(
x.reshape(img_shape), gauss_psf_almost, mode="same"
).reshape(-1)
def gmres_jax(linear_map, b, x0, n_iter):
# Initialization
n = x0.shape[0]
r0 = b - linear_map(x0)
beta = jnp.linalg.norm(r0)
V = jnp.zeros((n_iter + 1, n))
V = V.at[0].set(r0 / beta)
H = jnp.zeros((n_iter + 1, n_iter))
def loop_body(j, pair):
"""
One basic step of GMRES; compute new Krylov vector and orthogonalize.
"""
H, V = pair
w = linear_map(V[j])
h = V @ w
v = w - (V.T) @ h
v_norm = jnp.linalg.norm(v)
H = H.at[:, j].set(h)
H = H.at[j + 1, j].set(v_norm)
V = V.at[j + 1].set(v / v_norm)
return H, V
# Do n_iter iterations of basic GMRES step
H, V = jax.lax.fori_loop(0, n_iter, loop_body, (H, V))
# Solve the linear system in the basis V
e1 = jnp.zeros(n_iter + 1)
e1 = e1.at[0].set(beta)
y = jnp.linalg.lstsq(H, e1, rcond=None)[0]
# Convert result back to full basis and return
x_new = x0 + V[:-1].T @ y
return x_new
b = img_blur.reshape(-1)
x0 = jnp.zeros_like(b)
x = x0
n_restart = 50
# Declare JIT compiled version of gmres_jax
gmres_jit = jax.jit(gmres_jax, static_argnums=[0, 3])
print("Compiling function:")
%time x = gmres_jit(do_convolution, b, x0, n_restart).block_until_ready()
print("\nProfiling functions. numpy version:")
%timeit x = gmres(linear_map, b, x0, n_restart)
print("\nProfiling functions. JAX version:")
%timeit x = gmres_jit(do_convolution, b, x0, n_restart).block_until_ready()
```

```
Compiling function:
CPU times: user 1.94 s, sys: 578 ms, total: 2.51 s
Wall time: 2.01 s
Profiling functions. numpy version:
263 ms ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Profiling functions. JAX version:
9.16 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

With the JAX version running on my GPU, we get a 30x times speedup! Not bad, if you ask me. If we run the same code on CPU, we still get a 4x speedup. This means that the version compiled by JAX is already faster in its own right.

The code above may look a bit strange, and there are definitely some things that might need some explanation.
First of all, note that the first time we call `gmres_jit`

it takes much longer than the subsequent calls.
This is because the function is JIT – just in time compiled. On the first call, JAX runs through the entire
function and makes a big graph of all the operations that need to be done, it then optimizes (simplifies) this
graph, and compiles it to create a very fast function. This compilation step obviously takes some time, but
the great thing is that we only need to do it once.

Note the way we create the function `gmres_jit`

:

```
gmres_jit = jax.jit(gmres_jax, static_argnums=[0, 3])
```

Here we tell JAX that if the first or the fourth argument changes, the function needs to be recompiled. This is because both these arguments are python literals (the first is a function, the fourth is the number of iterations), whereas the other two arguments are arrays.

The shape of the arrays `V`

and `H`

depend on the last argument `n_iter`

. However, the compiler needs to know the shape of these arrays *at compile time*. Therefore, we need to recompile the function every time that `n_iter`

changes. The same is true for the `linear_map`

argument; the
shape of the vector `w`

depends on `linear_map`

in principle.

Next, consider the fact that there is no more `for`

loop in the code, and it is instead replaced by

```
H, V = jax.lax.fori_loop(0, n_iter, loop_body, (H, V))
```

We could in fact use a for loop here as well, and it would give an identical result but it would take much
longer to compile. The reason for this is that, as mentioned, JAX runs through the entire function and makes a
graph of all the operations that need to be done. If we leave in the for loop, then each iteration of the loop
would add more and more operations to the graph (the loop is ‘unrolled’), making a really big graph. By using
`jax.lax.fori_loop`

we can skip this, and end up with a much smaller graph to be compiled.

One disadvantage of this approach is that the size of all arrays needs to be known at compile time. In the
original algorithm we did not compute `(V.T) @ h`

for example, but rather `(V[:j+1].T) @ h`

. Now we can’t do
that, because the size of `V[:j+1]`

is not known at compile time. The end result ends up being the same
because at iteration `j`

, we have `V[j+1:] = 0`

. This actually means that over all the iterations of `j`

we
end up doing about double the work for this particular operation. However, because the operation is so much
faster on a GPU this is not a big problem.

As we can see, writing code for GPUs requires a bit more thought than writing code for CPUs. Sometimes we even end up with less efficient code, but this can be entirely offset by the improved speed of the GPU.

We see above that GMRES provides a very fast and accurate solution to the deconvolution problem. This has a lot to do with the fact that the convolution matrix is very well-conditioned. We can see this by looking at the singular of this matrix. The convolution matrix for a 128x128 image is a bit too big to work with, but we can see what happens for 32x32 images.

```
N = 11
psf = gaussian_psf(sigma=1, N=N)
img_shape = (32, 32)
def create_conv_mat(psf, img_shape):
tot_dim = np.prod(img_shape)
def apply_psf(signal):
signal = signal.reshape(img_shape)
return convolve2d(signal, psf, mode="same").reshape(-1)
conv_mat = np.zeros((tot_dim, tot_dim))
for i in range(tot_dim):
signal = np.zeros(tot_dim)
signal[i] = 1
conv_mat[i] = apply_psf(signal)
return conv_mat
conv_mat = create_conv_mat(psf, img_shape)
svdvals = scipy.linalg.svdvals(conv_mat)
plt.plot(svdvals)
plt.yscale('log')
cond_num = svdvals[0]/svdvals[-1]
plt.title(f"Singular values. Condition number: {cond_num:.0f}")
```

As we can see, the condition number is only 4409, which makes the matrix very well-conditioned. Moreover, the singular values decay somewhat gradually. What’s more, the convolution matrix is actually symmetric and positive definite. This makes the linear system relatively easy to solve, and explains why it works so well.

This is because the kernel we use – the Gaussian kernel – is itself symmetric. For a non-symmetric kernel, the situation is more complicated. Below we show what happens for a non-symmetric kernel, the same type as we used before in the blind deconvolution series of blog posts.

```
from utils import random_motion_blur
N = 11
psf_gaussian = gaussian_psf(sigma=1, N=N)
psf = random_motion_blur(
N=N, num_steps=20, beta=0.98, vel_scale=0.1, sigma=0.5, seed=42
)
img_shape = (32, 32)
# plot the kernels
plt.figure(figsize=(8, 4.5))
plt.subplot(1, 2, 1)
plt.imshow(psf_gaussian)
plt.title("Gaussian kernel")
plt.subplot(1, 2, 2)
plt.imshow(psf)
plt.title("Non-symmetric kernel")
plt.show()
# study convolution matrix
conv_mat = create_conv_mat(psf, img_shape)
plt.show()
eigs = scipy.linalg.eigvals(conv_mat)
plt.title(f"Eigenvalues")
plt.ylabel("Imaginary part")
plt.xlabel("Real part")
plt.scatter(np.real(eigs), np.imag(eigs), marker=".")
```

We see that the eigenvalues of this convolution matrix are distributed *around* zero. The convolution matrix for the gaussian kernel is symmetric and positive definite – all eigenvalues are positive real numbers. GMRES works really well when almost all eigenvalues lie in an ellipse *not containing zero*. That is clearly not the case here, and we in fact also see that GMRES doesn’t work well for this particular problem.
(Note that we now switch to 256x256 images instead of 128x128, since our new implementation of GMRES is much faster)

```
img = image.imread("imgs/vitus256.png")
psf = random_motion_blur(
N=N, num_steps=20, beta=0.98, vel_scale=0.1, sigma=0.5, seed=42
)
img_blur = convolve2d(img, psf, mode="same")
img_shape = img.shape
def do_convolution(x):
res = jax.scipy.signal.convolve2d(
x.reshape(img_shape), psf, mode="same"
).reshape(-1)
return res
b = img_blur.reshape(-1)
x0 = jnp.zeros_like(b)
x = x0
n_restart = 1000
n_its = 10
losses = []
for _ in range(n_its):
x = gmres_jit(do_convolution, b, x, n_restart)
error = np.linalg.norm(do_convolution(x) - b) ** 2
losses.append(error)
```

Not does it take much more iterations to converge, the final result is unsatisfactory at best. Clearly without further modifications the GMRES method doesn’t work well for deconvolution for non-symmetric kernels.

As mentioned, GMRES works best when the eigenvalues of the matrix \(A\) are in an ellipse not including zero, which is not the case for our convolution matrix. There is fortunately a very simple solution to this: instead of solving the linear least-squares problem

\[\min_x \|Ax - b\|_2^2\]We solve the linear least squares problem

\[\min_x \|A^\top A x - A^\top b\|^2\]This will have the same solution, but the eigenvalues of \(A^\top A\) are better behaved. Any matrix like this will be positive semi-definite, and all eigenvalues will be real and non-negative. They therefore all fit inside an ellipse that doesn’t include zero, and we will get much better convergence with GMRES. In general, we could multiply by any matrix \(B\) to obtain the linear least-squares problem

\[\min_x \|BAX-Bb\|^2\]If we choose \(B\) such that the spectrum (eigenvalues) of \(BA\) are nicer, then we can improve convergence of GMRES. This trick is called *preconditioning*. Choosing a good *preconditioner* depends a lot on the problem at hand, and is the subject of a lot of research. In this context, \(A^\top\) turns out to function as an excellent preconditioner, as we shall see.

To apply this trick to the deconvolution problem, we need to be able to take the transpose of the convolution operation. Fortunately, this is equivalent to convolution with a reflected version \(\overline k\) of the kernel \(k\). That is, we will apply GMRES to the linear least-squares problem

\[\min_x \|\overline k *(k*x) - \overline k * y\|\]let’s see this in action below.

```
img = image.imread("imgs/vitus256.png")
psf = random_motion_blur(
N=N, num_steps=20, beta=0.98, vel_scale=0.1, sigma=0.5, seed=42
)
psf_reversed = psf[::-1, ::-1]
img_blur = convolve2d(img, psf, mode="same")
img_shape = img.shape
def do_convolution(x):
res = jax.scipy.signal.convolve2d(x.reshape(img_shape), psf, mode="same")
res = jax.scipy.signal.convolve2d(res, psf_reversed, mode="same")
return res.reshape(-1)
b = jax.scipy.signal.convolve2d(img_blur, psf_reversed, mode="same").reshape(-1)
x0 = jnp.zeros_like(b)
x = x0
n_restart = 100
n_its = 20
# run once to compile
gmres_jit(do_convolution, b, x, n_restart)
time_start = perf_counter_ns()
losses = []
for _ in range(n_its):
x = gmres_jit(do_convolution, b, x, n_restart)
error = np.linalg.norm(do_convolution(x) - b) ** 2
losses.append(error)
time_taken = (perf_counter_ns() - time_start) / 1e9
print(f"Deconvolution in {time_taken:.2f} s")
```

```
Deconvolution in 1.40 s
```

Except for some ringing around the edges, this produces very good result. Compared to other methods of deconvolution (as discussed in this blog post) this in fact shows much less ringing artifacts. It’s pretty fast as well. Even though it takes us around 2000 iterations to converge, the differences between the image after 50 steps or 2000 steps is not that big visually speaking. Let’s see how the solution develops with different numbers of iterations:

```
x0 = jnp.zeros_like(b)
x = x0
results_dict = {}
for n_its in [1, 5, 10, 20, 50, 100]:
x0 = jnp.zeros_like(b)
# run once to compile
gmres_jit(do_convolution, b, x0, n_its)
time_start = perf_counter_ns()
for _ in range(10):
x = gmres_jit(do_convolution, b, x0, n_its)
time_taken = (perf_counter_ns() - time_start) / 1e7
results_dict[n_its] = (x, time_taken)
```

After just 100 iterations the result is pretty good, and this takes just 64ms. This makes it a viable method for deconvolution, roughly equally as fast as Richardson-Lucy deconvolution, but suffering less from boundary artifacts. The regularization methods we have discussed in the deconvolution blog posts also work in this setting, and are good to use in the case where there is noise, or where we don’t precisely know the convolution kernel. That is however out of the scope of this blog post.

GMRES is an easy to implement, fast and robust method for solving *structured* linear system, where we only
have access to matrix-vector products \(Ax\). It is often used for solving sparse systems, but as we have
demonstrated, it can also be used for solving the deconvolution problem in a way that is competitive with
existing methods. Sometimes a preconditioner is needed to get good performance out of GMRES, but choosing a
good preconditioner can be difficult. If we implement GMRES on a GPU it can reach much higher speeds than on
CPU.

`numpy`

.
Often if we have an \(m\times n\) matrix, we can write it as the product of two
smaller matrices. If such a matrix has *rank* \(r\), then we can write it as the
product of an \(m\times r\) and \(r\times n\) matrix. Equivalently, this is the
*number of linearly independent columns or rows* the matrix has, or if we see
the matrix as a linear map \(\mathbb R^m\to \mathbb R^n\), then it is the
*dimension of the image* of this linear map.

In practice we can figure out the rank of a matrix by computing its *singular
value decomposition* (SVD). If you studied data science or statistics, then you
have probably seen principal component analysis (PCA); this is very closely
related to the SVD. Using the SVD we can write a matrix \(X\) as a product

Where \(U\) and \(V\) are orthogonal matrices, and \(S\) is a diagonal matrix. The
values on the diagonals of \(S\) are known as the *singular values* of \(S\). The
matrices \(U\) and \(V\) also have nice interpretations; the rows of \(U\) form an
orthonormal basis of the *row space* of \(X\), and the columns of \(V\) are an
orthonormal basis of the *column space* of \(X\).

In `numpy`

we can compute the SVD of a matrix using `np.linalg.svd`

. Below we
compute it and verify that indeed \(X = U S V\):

```
import numpy as np
# Generate a random 10x20 matrix of rank 5
m, n, r = (10, 20, 5)
A = np.random.normal(size=(m, r))
B = np.random.normal(size=(r, n))
X = A @ B
# Compute the SVD
U, S, V = np.linalg.svd(X, full_matrices=False)
# Confirm U S V = X
np.allclose(U @ np.diag(S) @ V, X)
```

`True`

Note that we called `np.linalg.svd`

with the keyword `full_matrices=False`

. If
left to the default value `True`

, then in this case `V`

would be a \({20\times
20}\) matrix, as opposed to the \(10\times 20\) matrix it is now. Also `S`

is
returned as a 1D array, and we can convert it to a diagonal matrix using
`np.diag`

. Finally the function `np.allclose`

checks if all the entries of two
matrices are almost the same; they never will be exactly the same due to
numerical error.

As mentioned before, we can use the singular values `S`

to determine what the
rank is the matrix `X`

. This is obvious if we plot the singular values:

```
import matplotlib.pyplot as plt
DEFAULT_FIGSIZE = (8, 5)
plt.figure(figsize=DEFAULT_FIGSIZE)
plt.plot(np.arange(1, len(S) + 1), S, "o")
plt.xticks(np.arange(1, len(S) + 1))
plt.yscale("log")
plt.title("Plot of singular values")
```

We see that the first 5 singular values are roughly the same size, but that the last five singular values are much smaller; on the order of the machine epsilon.

Knowing the matrix is rank 5, can we write it as the product of two rank 5
matrices? Absolutely! And we do this using the SVD, or rather the *truncated
singular value decomposition*. Since the last 5 values of `S`

are very close to
zero, we can simply ignore them. This then means dropping the last 5 columns of
`U`

and the last 5 rows of `V`

. Then finally we just need to ‘absorb’ the
singular values into one of the two matrices `U`

or `V`

, This way we write `X`

as the product of a \(10\times 5\) and \(5\times 20\) matrix.

```
A = U[:, :r] * S[:r]
B = V[:r, :]
print(A.shape, B.shape)
np.allclose(A @ B, X)
```

`(10, 5) (5, 20)`

`True`

We rarely encounter real-world data that can be *exactly* represented by a low
rank matrix using the truncated SVD. But we can still use the truncated SVD to
get a good *approximation* of the data.

Let us look at the singular values of an image of the St. Vitus church in my hometown. Note that a black-and-white image is really just a matrix.

```
from matplotlib import image
# Load and plot the St. Vitus image
plt.figure(figsize=(14, 5))
plt.subplot(1, 2, 1)
img = image.imread("vitus512.png")
img = img / np.max(img) # make entries lie in range [0,1]
plt.imshow(img, cmap="gray")
plt.axis("off")
# Compute and plot the singular values
plt.subplot(1, 2, 2)
plt.title("Singular values")
U, S, V = np.linalg.svd(img)
plt.yscale("log")
plt.plot(S)
```

We see here that the first few singular values are much larger than the rest, followed by a slow decay, and then finally a sharp drop at the very end. Note that there are 512 singular values, because this is a 512x512 image.

Let’s now try to see what happens if we compress this image as a low rank matrix using the truncated singular value decomposition. We will look what happens to the image when seen as a rank 10,20,50 or 100 matrix.

```
plt.figure(figsize=(12, 12))
for i, rank in enumerate([10, 20, 50, 100]):
# Compute truncated SVD
U, S, V = np.linalg.svd(img)
img_compressed = U[:, :rank] @ np.diag(S[:rank]) @ V[:rank, :]
# Plot the image
plt.subplot(2, 2, i + 1)
plt.title(f"Rank {rank}")
plt.imshow(img_compressed, cmap="gray")
plt.axis("off")
```

We see that even the rank 10 and 20 images are pretty recognizable, but with heavy artifacts. On the other hand the rank 50 image looks pretty good, but not as good as the original. The rank 100 image on the other hand looks really close to the original.

How big is the compression if we do this? Well, if we write the image as a rank 10 matrix, we need two 512x10 matrices to store the image, which adds up to 10240 parameters, as opposed to the original 262144 parameters; a decrease in storage of more than 25 times! On the other hand, the rank 100 image is only about 2.6 times smaller than the original. Note that this is not a good image compression algorithm; the SVD is relatively expensive to compute, and other compression algorithms can achieve higher compression ratios with less image degradation.

The conclusion we can draw from this is that we can use truncated SVD to compress data. However, not all data can be compressed as efficiently by this method. It depends on the distribution of singular values; the faster the singular values decay, the better a low rank decomposition is going to approximate our data. Images are not good examples of data that can be compressed efficiently as a low rank matrix.

One reason why it’s difficult to compress images is because they contain many sharp edges and transitions. Low rank matrices are especially bad at representing diagonal lines. For example, the identity matrix is a diagonal line seen as an image, and it is also impossible to compress using an SVD since all singular values are equal.

On the other hand, images without any sharp transitions can be approximated quite well using low rank matrices. These kind of images rarely appear as natural images, but rather they can be discrete representations of smooth functions \([0,1]^2 \to\mathbb R\). For example below we show a two-dimensional discretized sum of trigonometric functions and its singular value decomposition.

```
# Make a grid of 100 x 100 values between [0,1]
x = np.linspace(0, 1, 100)
y = np.linspace(0, 1, 100)
x, y = np.meshgrid(x, y)
# A smooth trigonometric function
def f(x, y):
return np.sin(200 * x + 75 * y) + np.sin(50 * x) + np.cos(100 * y)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
X = f(x, y)
plt.imshow(X)
plt.subplot(1, 2, 2)
U, S, V = np.linalg.svd(X)
plt.plot(S)
plt.yscale("log")
plt.title("Singular values")
print(f"The matrix is approximately of rank: {np.sum(S>1e-12)}")
```

`The matrix is approximately of rank: 4`

We see that this particular function can be represented by a rank 4 matrix! This is not obvious if you look at the image. In these kind of situations a low-rank matrix decomposition is much better than many image compression algorithms. In this case we can reconstruct the image using only 8% of the parameters. (Although more advanced image compression algorithms are based on wavelets, and will actually compress this very well.)

Recall that a low rank matrix approximation can require much less parameters than the dense matrix it approximates. One of the powerful things about this allows us to recover the dense matrix even in the case where we only observe a small part of the matrix. That is, if we have many missing values.

In the case above we can represent the 100x100 matrix \(X\) as a product of a 100x4 and 4x100 a matrix \(A\) and \(B\), which has in total 800 parameters instead of 10,000. We can actually recover this low-rank decomposition from a small subset of the big dense matrix. Suppose that we observe the entries \(X_{ij}\) for \((i,j)\in \Omega\) an index set. We can recover \(A\) and \(B\) by solving the following least-squares problem:

\[\min_{A,B}\sum_{(i,j)\in \Omega}((AB)_{ij}-X_{ij})^2\]This problem is however non-convex, and not straightforward to solve. There is fortunately a trick: we can alternatively fix \(A\) and then optimize \(B\) and vice-versa. This is known as Alternating Least Squares (ALS) optimization, and in this case works well. If we fix \(A\), observe that the minimization problem uncouples into separate linear least squares problems for each column of \(B\):

\[\min_{B_{\bullet k}} \sum_{(i,j)\in \Omega,\,j=k} (\langle A_{i\bullet},B_{\bullet k}\rangle-X_{ik})^2\]Below we do use this approach to recover the same matrix as before using 2000 data points, and we can see it does so with a very low error:

```
N = 2000
n = 100
r = 4
# Sample N=2000 random indices
Omega = np.random.choice(n * n, size=N, replace=False)
Omega = np.unravel_index(Omega, X.shape)
y = X[Omega]
# Use random initialization for matrices A,B
A = np.random.normal(size=(n, r))
B = np.random.normal(size=(r, n))
def linsolve_regular(A, b, lam=1e-4):
"""Solve linear problem A@x = b with Tikhonov regularization / ridge
regression"""
return np.linalg.solve(A.T @ A + lam * np.eye(A.shape[1]), A.T @ b)
losses = []
for i in range(40):
loss = np.mean(((A @ B)[Omega] - y) ** 2)
losses.append(loss)
# Update B
for j in range(n):
B[:, j] = linsolve_regular(A[Omega[0][Omega[1] == j]], y[Omega[1] == j])
# Update A
for i in range(n):
A[i, :] = linsolve_regular(
B[:, Omega[1][Omega[0] == i]].T, y[Omega[0] == i]
)
# Plot the input image
plt.figure(figsize=(12, 12))
plt.subplot(2, 2, 1)
plt.title("Input image")
S = np.zeros((n, n))
S[Omega] = y
plt.imshow(S)
# Plot reconstructed image
plt.subplot(2, 2, 2)
plt.title("Reconstructed image")
plt.imshow(A @ B)
# Plot training loss
plt.subplot(2, 1, 2)
plt.title("Mean square error loss during training")
plt.plot(losses)
plt.yscale("log")
plt.xlabel("steps")
plt.ylabel("Mean squared error")
```

Let’s consider a particularly interesting use of matrix completion –
collaborative filtering. Think about how services like Netflix may recommend new
shows or movies to watch. They know which movies you like, and they know which
movies other people like. Netflix then recommends movies that are liked by
people with a similar taste to yours. This is called *collaborative filtering*,
because different people *collaborate* to filter out movies so that we can make
a recommendation.

But can we do this in practice? Well, for every user we can put their personal
ratings of every movie they watched in a big matrix. In this matrix each row
represents a movie, and each column a user. Most users have only seen a small
fraction of all the movies on the platform, so the overwhelming majority of the
entries of this matrix are unknown. Then we apply matrix completion to this
matrix. Each entry of the completed matrix then represents the rating *we think*
the user would give to a movie, even if they have never watched it.

In 2006 Netflix opened a competition with a grand prize of **$1,000,000** (!!)
to solve precisely this problem. The data consists of more than 100 million
ratings by 480,189 users for 17,769 different movies. The size of this dataset
immediately poses a practical problem; if we put this in a matrix with floating
point entries, then it would require about 68 terabytes of RAM. Fortunately we
can avoid this problem by using sparse matrices. This makes implementation a
little harder, but certainly still feasible.

We will also need to upgrade our matrix completion algorithm. The algorithm we mentioned before is slow for very large matrices, and suffers from problems of numerical stability due to the way it decouples into many smaller linear problems. Recall that complete a matrix \(X\) by solving the following optimization problem:

\[\min_{A,B}\sum_{(i,j)\in \Omega}((AB)_{ij}-X_{ij})^2.\]We will first rewrite the problem as follows:

\[\min_{A,B}\|P_\Omega(AB) -X\|.\]Here \(P_\Omega\) denotes the operation of setting all entries \(AB_{ij}\) to zero if \((i,j)\notin \Omega\). In other words, \(P_\Omega\) turns \(AB\) into a sparse matrix with the same sparsity pattern as \(X\). In some sense, the issue with this optimization problem is that only a small part of the entries of \(AB\) affect the the objective. We can solve this by adding a new matrix \(Z\) such that \(P_\Omega(Z)=X\), and then using \(A,B\) to approximate \(Z\) instead:

\[\min_{A,B,Z}\|AB-Z\|\quad \text{such that } P_\Omega Z = X\]This problem can then be solved using the same alternating least-squares approach we have used before. For example if we fix \(A,B\) then the optimal value of \(Z\) is given by \(Z = AB+X-P_\Omega(Z)\), and at each iteration we can update \(A\) and \(B\) by solving a linear least-squares problem. It is important to note that this way \(Z\) is a sum of a low-rank and a sparse matrix at every step, and this allows us to still efficiently manipulate it and store it in memory.

Although not very difficult, the implementation of this algorithm is a little too technical for this blog post. Instead we can just look at the results. I used this algorithm to fit matrices \(A\) and \(B\) of rank 5 and of rank 10 to the Netflix prize dataset. I used 3000 iterations of training, taking the better part of a day to train on my computer. I could probably do more, but I’m too impatient. The progress of training is shown below.

```
import os.path
plt.figure(figsize=DEFAULT_FIGSIZE)
DATASET_PATH = "/mnt/games/datasets/netflix/"
for r in [10, 5]:
model = np.load(os.path.join(DATASET_PATH, f"rank-{r}-model.npz"))
A = model["X"]
B = model["Y"]
train_errors = model["train_errors"]
test_errors = model["test_errors"]
plt.plot(np.sqrt(train_errors), label=f"Train rank {r}")
plt.plot(np.sqrt(test_errors), label=f"Test rank {r}")
plt.legend()
plt.ylim(0.8, 1.5)
plt.xlabel("Training iterations")
plt.ylabel("Root mean squared error (RMSE)");
```

We see that the training error for the rank 5 and rank 10 models are virtually identical, but the test error is lower for the rank 5 model. We can interpret this as the rank 10 model overfitting more, which is often the case for more complex models.

Next, how can we use this model? Well, the rows of the matrix \(A\) correspond to
movies, and the columns of matrix \(B\) correspond to users. So if we want to know
how much user #179 likes movie #2451 (*Lord of the Rings: The Fellowship of the
Ring*), then we compute \(A[2451]\cdot B[:, 179]\):

```
A[2451] @ B[:, 179]
```

`4.411312294862265`

We see that the *expected rating* (out of 5) for this user and movie is about
4.41. So we expect that this user will like this movie, and we may choose to
recommend it.

But we want to find the *best* recommendation for this user. To do this we can
simply compute the product \(A \cdot B[:,179]\), which will give a vector with
expected rating for every single movie, and then we simply sort. Below we can
see the 5 movies with the highest and lowest expected ratings for this user.

```
import pandas as pd
movies = pd.read_csv(
os.path.join(DATASET_PATH, "movie_titles.csv"),
names=["index", "year", "name"],
usecols=[2],
)
movies["ratings-179"] = A @ B[:, 179]
movies.sort_values("ratings-179", ascending=False)
```

name | ratings-179 | |
---|---|---|

10755 | Kirby: A Dark & Stormy Knight | 9.645918 |

15833 | Paternal Instinct | 7.712654 |

15355 | Last Hero In China | 7.689984 |

14902 | Warren Miller's: Ride | 7.624472 |

2082 | Blood Alley | 7.317524 |

... | ... | ... |

463 | The Return of Ruben Blades | -6.037189 |

12923 | Where the Red Fern Grows 2 | -6.153577 |

7067 | Eric Idle's Personal Best | -6.441100 |

538 | Rumpole of the Bailey: Series 4 | -6.740144 |

4331 | Sugar: Howling of Angel | -7.015818 |

17769 rows × 2 columns

Note that the expected ratings are not between 0 and 5, but can take on any value (in particular non-integer ones). This is not necessarily a problem, because we only care about the relative rating of the movies.

To me, all these movies all sound quite obscure. And this makes sense, the model does not take factors such as popularity of the movie into account. It also ignores a lot of other data that we may know about the user, such as their age, gender and location. It ignores when the movie is released, and it doesn’t take into account the dates of all the movie ratings of each user. These are all important factors, that could significantly improve the quality of this the recommendation system.

We could try to modify our matrix completion model to take these factors into account, but it’s not obvious how to do this. There is no need to do this however, we use the matrices \(A\), \(B\) to augment any data we have about the movie and the user. And then we can train a new model on top of this data, to create something even better.

We can think of the movies as lying in a really high-dimensional space, and the
matrix \(A\) maps this space onto a much smaller space. The same is true for the
\(B\) and the ‘space’ of users. We can then use this *embedding* into a lower
dimensional space as the input of another model.

Unfortunately we don’t have access to more information about the users (due
to obvious privacy concerns), so this is difficult to demonstrate. But the point
is this: the decomposition \(X\approx AB\) is both *interpretable*, and can be
used as a building block for more advanced machine learning models.

In summary we have seen that low-rank matrix decompositions have many useful applications in machine learning. They are powerful because they can be learned using relatively little data, and have the ability to complete missing data. Unlike many other machine learning models, computing low-rank matrix decompositions of data can be done quickly.

Even though they come with some limitations, they can always be used as a building block for more advanced machine learning models. This is because they can give an interpretable, low-dimensional representation of very high-dimensional data. We also didn’t even come close to discussing all their applications, or algorithms on how to find and optimize them.

In the next post I will look at a generalization of low-rank matrix
decompositions: *tensor decompositions*. While more complicated, these
decompositions are even more powerful at reducing the dimensionality of very
high-dimensional data.

Keeping both a web, Word, and PDF version all up-to-date and easy to edit seemed like an annoying task. I have plenty experience with automatically generating PDF documents using LaTeX and Python, so I figured why should a Word document be any different? Let’s dive into the world of editing Word documents in Python!

Fortunately there is a library for this: `python-docx`

. It can be used to create
Word documents from scratch, but stylizing a document is a bit tricky. Instead,
its real power lies in editing pre-made documents. I went ahead and made a nice
looking CV in Word, and now let’s open this document in `python-docx`

. A Word
document is stored in XML under the hoods, and there can be a complicated tree
structure to a document. However, we can create a document and use the
`.paragraphs`

attribute for a complete list of all the paragraphs in the
document. Let’s take a paragraph, and print it’s text content.

```
from docx import Document
document = Document("resume.docx")
paragraph = document.paragraphs[0]
print(paragraph.text)
```

```
Rik Voorhaar
```

Turns out the first paragraph contains my name! Editing this text is very easy;
we just need to set a new value to the `.text`

attribute. Let’s do this and safe
the document.

```
paragraph.text = "Willem Hendrik"
document.save("resume_edited.docx")
```

Below is a picture of the resulting change; it unfortunately seems like two additional things happened when editing this paragraph: the font of the edited paragraph changed, and the bar / text box on the right-hand side disappeared completely!

This is no good, but to understand what happened to the text box we need to dig into the XML of the document. We can turn the document into an XML file like so:

```
document = Document("resume.docx")
with open('resume.xml', 'w') as f:
f.write(document._element.xml)
```

It seems the problem was that the text box on the right was nested inside an
other object, which is apparently not handled properly. This issue was easy to
fix by modifying the Word document. However, the right bar on the side consists
of 2 text boxes, and the top box with my contact information *does* disappear if
I change the first paragraph. *But*, it does not disappear if I change the
second paragraph; it only happens if I change paragraph 1 or 3 (and the latter
is empty). I tried inserting two paragraphs before this particular paragraph, or
changing the style of this particular paragraph, but the issue remains.

Looking at the XML the issue is clear: the text box element lies nested inside this paragraph! It turned out to be a bit tricky to avoid this, so for now let us then try changing the second paragraph, changing the word “resume” for “curriculum vitae”.

```
document = Document("resume.docx")
paragraph = document.paragraphs[1]
print(paragraph.text)
paragraph.text = "Curriculum Vitae"
document.save("CV.docx")
```

```
Resume
```

If we do this there’s no problems with text boxes disappearing, but unfortunately the style of this paragraph is still reset when we do this. Let’s have a look at how the XML changes when we edit this paragraph. Ignoring irrelevant information, before changing it looks like this:

```
<w:p>
<w:r>
<w:t>R</w:t>
</w:r>
<w:r>
<w:t>esume</w:t>
</w:r>
</w:p>
```

And afterwards it looks like this:

```
<w:p>
<w:r>
<w:t>Curriculum Vitae</w:t>
</w:r>
</w:p>
```

In Word, each paragraph (`<p>`

) is split up in multiple runs (`<r>`

). What we
see here is that originally the paragraph was two runs, and after modifying it,
it became a single run. However, it seems that in both cases the style
information is exactly the same, so I don’t understand why the style changes
after modification. In this case if I retype the word ‘Resume’ in the original
word document, this paragraph become a single run, but *still* the style changes
after editing, and I still don’t see why this happens when looking at the XML.

Looking at the source code of `python-docx`

I noticed that when we call
`paragraph.text = ...`

, what happens is that the contents of the paragraph get
deleted, and then a new run is added with the desired text. It is not clear to
me at where exactly the style information is stored, but either way there is a
simple workaround to what we’re trying to do: we can simply modify the text of
the first *run* in the paragraph, rather than clearing the entire paragraph and
adding a new one. This in fact also works for editing the first paragraph,
where before we had problems with disappearing text boxes:

```
document = Document("resume.docx")
with open('resume.xml', 'w') as f:
f.write(document._element.xml)
# Change 'Rik Voorhaar' for 'Willem Hendrik Voorhaar'
paragraph = document.paragraphs[0]
run = paragraph.runs[1]
run.text = 'Willem Hendrik Voorhaar'
# Change 'Resume' for 'Curriculum Vitae'
paragraph = document.paragraphs[1]
run = paragraph.runs[0]
run.text = 'Curriculum Vitae'
document.save('CV.docx')
```

Doing this changes the text, but leaves all the style information the same. Alright, now we now how to edit text. It’s more tricky than one might expect, but it does work!

Let’s say that next we want to edit the text box on the right-hand side of the document, and add a skill to our list of skills. We’ve been diving into the inner workings of Word documents, so it’s fair to say we know how to use Microsoft Word, so let’s add the skill “Microsoft Word” to the list.

To do this we first want to figure out in which paragraph this information is stored. We can do this by going through all the paragraphs in the document and looking for the text “Skills”.

```
import re
pattern = re.compile("Skills")
for p in document.paragraphs:
if pattern.search(p.text):
print("Found the paragraph!")
break
else:
print("Did not find the paragraph :(")
```

```
Did not find the paragraph :(
```

Seems like there is unfortunately no matching paragraph! This is because the
paragraph we want is *inside a text box*, and modifying text boxes is not supported
in `python-docx`

. This is a known issue, but instead of giving up I decided to
add support for modifying text boxes to `python-docx`

myself! It turned out not to
be too difficult to implement, despite my limited knowledge of both the package
and the inner structure of Word documents.

The first step is understanding how text boxes are encoded in the XML. It turns out that the structure is something like this:

```
<mc:AlternateContent>
<mc:Choice Requires="wps">
<w:drawing>
<wp:anchor>
<a:graphics>
<a:graphicData>
<wps:txbx>
<w:txbxContent>
...
<w:txbxContent>
</wps:txbx>
</a:graphicData>
</a:graphics>
</wp:anchor>
</w:drawing>
</mc:Choice>
<mc:Fallback>
<w:pict>
<v:textbox>
<w:txbxContent>
...
<w:txbxContent>
</v:textbox>
</w:pict>
</mc:Fallback>
</mc:AlternateContent>
```

The insides of the two `<w:txbxContent>`

elements are exactly identical. The
information is stored twice probably for legacy reasons. A quick Google reveals
that `wps`

is an XML namespace introduced in Office 2010, and WPS is short for
Word Processing Shape. The textbox is therefore stored twice to maintain
backwards compatibility with older Word versions. Not sure many people still use
Office 2006… Either way, this means that if we want to update the contents of
the textbox, we need to do it in two places.

Next we need to figure out how to manipulate these word objects. My idea is to
create a `TextBox`

class, that is associated to an `<mc:AlternateContent>`

element, and which ensures that both `<w:txbxContent>`

elements are always
updated at the same time. First we make a class encoding a `<w:txbxContent>`

element. For this we can build on the `BlockItemContainer`

class already
implemented in `python-docx`

. Mixing in this class gives automatic support for
manipulating paragraphs inside of the container.

```
class TextBoxContent(BlockItemContainer)
```

Given an `<mc:AlternateContent>`

object, we can access the two `<w:txbxContent>`

elements using the following XPath specifications:

```
XPATH_CHOICE = "./mc:Choice/w:drawing/wp:anchor/a:graphic/a:graphicData//wps:txbx/w:txbxContent"
XPATH_FALLBACK = "./mc:Fallback/w:pict//v:textbox/w:txbxContent"
```

Then making a rudimentary `TextBox`

class is very simple. We base it on the
`ElementProxy`

class in `python-docx`

. This class is meant for storing and
manipulating the children of an XML element.

```
class TextBox(ElementProxy):
"""Implements texboxes. Requires an `<mc:AlternateContent>` element."""
def __init__(self, element, parent):
super(TextBox, self).__init__(element, parent)
try:
(tbox1,) = element.xpath(XPATH_CHOICE)
(tbox2,) = element.xpath(XPATH_FALLBACK)
except ValueError as err:
raise ValueError(
"This element is not a text box; it should contain precisely two \
``<w:txbxContent>`` objects"
)
self.tbox1 = TextBoxContent(tbox1, self)
self.tbox2 = TextBoxContent(tbox2, self)
```

So far this is just good for storing the text box, we still need some code to
actually manipulate it. It would also be great if we have a way to find all the
text boxes in a document. This is as simple as finding all the
`<mc:AlternateContent>`

elements with precisely two `<w:txbxContent>`

elements.
We can use the following function:

```
def find_textboxes(element, parent):
"""
List all text box objects in the document.
Looks for all ``<mc:AlternateContent>`` elements, and selects those
which contain a text box.
"""
alt_cont_elems = element.xpath(".//mc:AlternateContent")
text_boxes = []
for elem in alt_cont_elems:
tbox1 = elem.xpath(XPATH_CHOICE)
tbox2 = elem.xpath(XPATH_FALLBACK)
if len(tbox1) == 1 and len(tbox2) == 1:
text_boxes.append(TextBox(elem, parent))
return text_boxes
```

We then update the `Document`

class with a new `textboxes`

attribute:

```
@property
def textboxes(self):
"""
List all text box objects in the document.
"""
return find_textboxes(self._element, self)
```

Now let’s test this out:

```
document = Document("resume.docx")
document.textboxes
```

```
[<docx.oxml.textbox.TextBox at 0x7faf395c3bc0>,
<docx.oxml.textbox.TextBox at 0x7faf395c3100>]
```

Now to manipulate the “Skills” section as we initially wanted, we first find the
right paragraph. Since the two `<w:txbxContent>`

objects have the same
paragraphs, we need to find which *number* of paragraph contains the text, and
in which textbox:

```
import re
def find_paragraph(pattern):
for textbox in document.textboxes:
for i,p in enumerate(textbox.paragraphs):
if pattern.search(p.text):
return textbox,i
pattern = re.compile("Skills")
textbox, i = find_paragraph(pattern)
print(textbox.paragraphs[i].text)
```

```
Skills
```

Now to insert a new skill, we need to create a new paragraph with the text
“Microsoft Word”. For this we can find the paragraph right after, and this
paragraphs `insert_paragraph_before`

method with appropriate text and style
information. The paragraph in question is the one containing the word
“Research”. I want to copy the style of this paragraph to the new paragraph, but
for some reason the style information is empty for this paragraph. However, I
know that the style of this paragraph should be the `'Skillsentries'`

, so I can
just use that directly.

```
style = document.styles['Skillsentries']
pattern = re.compile("Research")
textbox,i = find_paragraph(pattern)
p1 = textbox.tbox1.paragraphs[i]
p2 = textbox.tbox2.paragraphs[i]
for p in (p1,p2):
p.insert_paragraph_before("Microsoft Word", p.style)
document.save("CV.docx")
```

When now opening the Word document, we see the item “Microsoft Word” in my list of skills, with the right style and everything. I did cheat a little; I needed to make some additional technical changes to the code for this all to work, but the details are not super important. If you want to use this feature, you can use my fork of python-docx. My solution is still a little hacky, so I don’t think it will be added to the main repository, but it does work fine for my purposes.

In summary, we *can* use Python to edit word documents. However the
`python-docx`

package is not fully mature, and using it for editing
highly-stylized word documents is a bit painful (but possible!). It is however
quite easy to extend with new functionality, in case you do need to do this. On
the other hand, there is quite extensive functionality in Visual Basic to edit
word documents, and the whole Word API is built around Visual Basic.

While I now have all the tools available to automatically update my CV using Python, I will actually refrain from doing it. It is a lot of work to set up properly, and needs active maintenance ever time I would want to change the styling of my CV. Probably it’s a better idea to just manually edit it every time I need to. Automatization isn’t always worth it. But I wouldn’t be surprised if this new found skill will be useful at some point in the future for me.

]]>What constitutes as ‘realistic’ blur obviously depends on context, but in the case of taking pictures with a hand-held camera or smartphone, it includes both motion blur and a form of lens blur. Generating lens blur is easy; we can just use a Gaussian blur. For motion blur we previously looked only at straight lines, but this isn’t very realistic. Natural motion is rarely just in a straight line, but is more erratic.

To model this we can take inspiration from physical processes such as Brownian motion: we can model motion blur as the path taken by a particle with an initial velocity, which is constantly perturbed during the motion. We want to add gaussian blur on top of that, which can simply be done by taking the image of such a path and convolving it with a gaussian point spread function. However, we should also take into account the speed of the particle; if we move a camera very fast then the camera spends less exposure time in any particular point. Therefore we should make the intensity of the blur inversely proportional to the speed at any point. The end result looks something like this:

In practice we will consider this kind of blur at a much smaller resolution, for example of size 15x15. Below we show how such a kernel will affect for example the St. Vitus image.

Recall that in the Richardson-Lucy algorithm we try to solve the deconvolution problem \(y=x*k\) by using an iteration of form

\[x_{i+1} = x_i\odot \left(\frac{y}{x_i*k}*k^*\right)\]This method is completely symmetric in \(k\) and \(x\), so given an estimate \(x_i\) of \(x\) we can recover the kernel \(k\) by the same method:

\[k_{j+1} = k_j\odot \left(\frac{y}{x*k_j}*x^*\right)\]A simple idea for blind deconvolution is therefore to alternatingly estimate \(k\) from \(x\) and vice-versa. We can see the result of this procedure below:

The problem with this Richardson-Lucy-based algorithm is that the point spread function tends to converge to a (shifted) delta function. This is an inherit problem with many blind deconvolution algorithm, especially those based on finding a maximum a posteriori (MAP) estimate of both the kernel and image combined. For this particular algorithm it isn’t immediately obvious why it tends to do this, since the analysis of this algorithm is relatively complicated. Somehow the kernel update step tends to promote sparsity. This tends to happen irrespective of how we initialize the point spread function, or the relative number of steps spent estimating the PSF or the image.

There are heuristic ways to get around this, but overall it is difficult to make a technique like this work well. It also doesn’t use the wonderful things we learned about image priors in part 2. We need a method that can actively avoid converging to extreme points such as this delta function.

In part 2 we discussed different image priors, of which the most promising prior is based on non-local self-similarity. This assigns to an image \(x\) a score \(L(x)\) signifying how ‘natural’ this image is. We saw that it indeed gave higher scores for images that are appropriately sharpened. A simple idea is then to try different point spread functions, and use the one with the highest score. If we denote \(x(k)\) the result of applying deconvolution with kernel \(k\), then we want to solve the maximization problem: \(\max_{k}L(x(k))\)

If we naively try to maximize this function we run into the problem that the space of all kernels is quite large; a \(15\times 15\) kernel obviously need \(15^2=225\) parameters. Since computing the image prior is relatively expensive (as is the deconvolution), exploring this large space is not feasible. Moreover, the function is relatively noisy, and has the problem that it can give large scores to oversharpened images.

We therefore need a way to describe the point spread functions using only a few parameters. Moreover, this description should actively avoid points that are not interesting, such as a delta function or a point spread function that would result in heavy oversharpening of the image.

There are many ways to describe a point spread function using only a couple parameters. One way that I propose is by writing it as a sum of a small number of Gaussian point spread functions. However instead of having the centered symmetric Gaussians we have considered so far, we will allow an arbitrary mean and covariance matrix. This changes respectively the center and the shape of the point spread function. That is, it depends on the parameters \(\mu=(\mu_1,\mu_2)\) and a 2x2 (symmetric, positive definite) matrix \(\Sigma\). Then the point spread function is given by

\[k[i,j]\propto \exp\left((i-\mu_1,j-\mu_2)\Sigma^{-1}(i-\mu_1,j-\mu_2)^\top\right),\qquad\sum_{i,j}k[i,j]=1\]To be precise, we can describe the covariance matrix \(\Sigma\) using three parameters \(\lambda_1,\lambda_2>0\) and \(\theta\in[0,\pi)\) using the decomposition

\[\Sigma = \begin{pmatrix}\cos\theta &\sin\theta\\-\sin\theta&\cos\theta\end{pmatrix} \begin{pmatrix}\lambda_1&0\\0&\lambda_2\end{pmatrix} \begin{pmatrix}\cos\theta &-\sin\theta\\\sin\theta&\cos\theta\end{pmatrix}\]We then use an additional parameter to combine different kernels of this type. By taking \(t_1k_1+t_2k_2+\dots+t_nk_n\)

This gives a total of 6 parameters per mixture component, but for the first component we can set the
mean \(\mu\) to \(0\) and use a magnitude \(t_1\) of 1, reducing to 3 parameters. For now we will try to
use a mix just use two mixture components \(n=2\), and focus our attention on *how* to optimize this.

We now know how to parameterize the point spread functions, and what function we want to optimize
(the image prior). Next is deciding *how* to optimize this. In this case we have a complicated,
noisy function that is expensive to compute, and with no easy way to compute its derivatives. In
situations like these Bayesian optimization or other methods of ‘black-box’ optimization make the
most sense.

How this works is that we sample our function \(L\colon \Omega\to \mathbb R\) in several points
\((z_1,\dots,z_n)\in\Omega\), where \(\Omega\) is our parameter searchspace. Based on these samples,
we build a *surrogate model* \(\widetilde L\colon\Omega\to \mathbb R\) for the function \(L\). We
can then optimize the surrogate model \(\widetilde L\) to obtain a new point \(z_{n+1}\). We then
compute \(L(z_{n+1})\), and update the surrogate model with this new information. This is then
repeated a number of times, or until convergence. So long as the surrogate model is good, this can
find an optimal point for the function \(L\) of interests much faster than many other optimization
methods.

The key property of this surrogate model is that it should be easy to compute, yet still model the true function reasonably well. In addition to this, we want to incorporate uncertainty into the surrogate model. Uncertainty enters in two ways: the function \(L\) may be noisy, and there is the fact that the surrogate model will be more accurate closer to previously evaluated points. This leads to Bayesian optimization. The surrogate model is probabilistic in nature, and during optimization we can sample points both to reduce the variance (explore regions where the model is unsure), and to reduce the expectation (explore regions of the searchspace where the model things the optimal point should lie).

One type of surrogate model that is popular for this purpose is the Gaussian process (GP) (also known as ‘kriging’ in this context). We will give a brief description of Gaussian processes. We model the function values of the surrogate model \(\widetilde L\) as random variables. More specifically we model the function value at a point \(z\) to depend on the samples:

\[\widetilde L(z) | z_1,\dots,z_n \sim N(\mu,\sigma^2),\]where the mean \(\mu\) is a weighted average of the values at the sampled points \((z_1,\dots, z_n)\), weighted by the distance \(\|z-z_i\|\). The variance \(\sigma^2\) is determined by a function \(K(z,z') = K(\|z-z'\|)\) which gives the covariance between two points, and increases the more distant the points. Note that \(K\) only depends on the distance between two points. At the sampled points \((z_1,\dots,z_n)\) we know the function \(\widetilde L(z)\) to high accuracy, and hence \(K(z_i,z_i) = K(0)\) is small, but as we go further from any of the sampled points the variance increases.

Because of the specific structure of the Gaussian process model, it is easy to fit to data and make
predictions at new points. As a result an optimal value for this surrogate model is easy to compute.
We will use an implementation of GP-based Bayesian optimization from `scikit-optimize`

. All in all
this gives us the results shown below.

As we can see, the estimated point spread function is still far from perfect, but nevertheless the deblurred image looks better than the blurred image. If we blur the image with larger kernels, or stronger blur overall recovery becomes even harder with this method. If we apply it to a different image the result is comparable. One problem that is apparent is the fact that the point spread function tends to shift the image. This can fortunately be corrected, by either changing the point spread function or shifting the image after deconvolution.

There are probably several reasons why this model doesn’t give perfect results. First is that the image prior isn’t perfect, but it seems that most image priors tend to give quite noisy outputs, or give high scores due to artifacts created by the deconvolution algorithm. Secondly, the parameter space of this model is still quite big, especially if the prior function depends in complicated manners on these parameters. However, it seems many methods used in the literature use even larger searchspaces for the kernels, many algorithms even using no compression of the searchspace at all and still claiming good results.

While I knew from the get-go that blind deconvolution is hard, it turned out to be even harder to do right than I expected. I read a lot of literature on the subject, and I learned a lot. Many papers give interesting algorithms and ideas for blind deconvolution methods. What I found however is that most papers where quite vague in their description and almost never included code. This makes doing research in this field quite difficult, since it can be very difficult to estimate whether or not a method is actually useful. Moreover, if a method looks promising then implementing it can become very difficult without adequate details.

]]>We will explore two methods to improve the deconvolution method. First is a simple modification to our current method, and second is an more expensive iterative method for deconvolution that works better for sparse kernels.

Recall that deconvolution comes down to solving the equation
\(y = k*x,\)
where \(y\) is the observed (blured) image, \(k\) is the point-spread function, \(x\) is the
unobserved (sharp) images. If we take a discrete Fourier Transform (DFT) then this equation becomes
\(Y = K\odot X,\)
where capital letters denote the Fourier-transformed variables, and \(\odot\) is the *pointwise*
multiplication. To solve the deconvolution problem we can then do pointwise division by \(K\) and then
do the inverse Fourier transform. Because \(K\) may have zero or near-zero entries, we can run into
numerical instability. A quick fix is to instead multiply by \(K^* / (|K|^2+\epsilon)\), giving
solution

This is fast to compute, and gives decent results. This simple method of deconvolution is known as
the Wiener filter. In the situation where there is some noise \(n\) such that \(y=k*x+n\), this
corresponds (for a certain value of \(\epsilon\)) to \(x^*\) minimizing the expected square error
\(E(\|x-x^*\|^2)\). Instead of minimizing the error, we can accept that \(\|k*x-y\|\approx \|n\|^2\),
and then find the *smoothest* \(x^*\) with that error, to avoid ringing artifacts. Smoothness can be
modeled by the laplacian \(\Delta x^*\) This leads to the problem

If \(L\) is the Fourier transform of the Laplacian kernel, then the solution to this problem has form

\[x = \mathcal F^{-1}\left(Y \odot \frac{K^*}{|K|^2+\gamma |L|^2}\right)\]where the parameter \(\gamma>0\) is determined by the noise level. In the end this is a simple modification to the Wiener filter, that should give less ringing effects. Let’s see what this does in practice.

In the picture above we tried to deblur a motion blur consisting of a diagonal strip of 10 pixels. The deblurring is done with a kernel of 9.6 pixels (the last pixel on either end is dimmed). We do this both with and without the Laplacian, with amounts of regularization so that the two methods have a similar amount of ringing artifacts. The two methods look very similar, and if anything the method without Laplacian may look a little sharper. The reason the methods behave so similarly is probably because the Fourier transform of the Laplacian (show below) has a fairly spread-out distribution and is therefore not too different from a uniform distribution we use in the Wiener filter.

There are many iterative deconvolution methods, and one often-used method in particular is Richardson-Lucy decomposition. The iteration step is given by

\[x_{k+1} = x_k\odot \left(\frac{y}{x_k*k}*k^*\right)\]Here \(k^*\) is the flipped point spread function, its Fourier transform is the complex conjugate of the Fourier transform of \(k\). As first iteration we typically pick \(x_0=y\). Note that if \(\sum_{i,j}k_{ij} = 1\), then \(\mathbf 1*k = \mathbf 1\), with \(\mathbf 1\) a constant 1 signal. Therefore if we plug in \(x_k = \lambda x\) we obtain

\[x_{k+1} = x_k \odot \left(\frac{y}{\lambda y}*k^*\right) = x_k\odot \frac1\lambda \mathbf 1*k^* = \frac{x_k}\lambda\]This both shows that \(x\) is a fixed point of the Richardson-Lucy algorithm, and at the same time it show that the algorithm doesn’t necessarily converge, since it could alternate between \(2x\) and \(x/2\) for example. In practice on natural images, if initialized with \(x_0=y\), it does seem to converge. Below we try this algorithm for different number of iterations, considering the same image and point spread function as before.

We see very similar ringing artifacts as with the Wiener filter. The number of iterations of the algorithm is related to the size of the regularization constant. The more iterations, the sharper the image is, but also the more pronounced the ringing artifacts are.

Like with the Wiener filter, we need to add a small positive constant when dividing, to avoid division-by-zero errors. Unlike the Wiener filter however, Richardson-Lucy deconvolution is very insensitive to the amount of regularization used.

Richardson-Lucy deconvolution is much slower than Wiener filter, requiring perhaps 100 iterations to
reach good result. Each iteration takes roughly as long as applying the Wiener filter. Fortunately
the algorithm is easy to implement on a GPU, and each iteration of the (426, 640) image above takes
only about 1ms on my computer with a simple GPU implementation using `cupy`

.

One issue that I have so far swept under the rug is the problem of boundary effects. If we convolve an \((n,m)\) image by a \((\ell,\ell)\) kernel, then the result is an image of size \((n+\ell-1, m+\ell-1)\), and not \((n,m)\). There is typically a ‘fuzzy border’ around the image, which we crop away when displaying, but not when deconvolving. In real life we don’t have the luxury of including this fuzzy border around the image, and this can lead to heavy artifacts when deconvolving an image. Below is the St. Vitus church image blurred with \(\sigma=3\) Gaussian blur, and subsequently deblurred using a Wiener filter with and without using the border around the image.

The ringing at the boundary is known as Gibbs oscillation. The reason it occurs is because the deconvolution method implicitly assumes the image is periodic. This is because the convolution theorem (stating that convolution becomes multiplication after a (discrete) Fourier transform) needs the assumption that the signal is periodic. If we would periodically stack a natural image we would find a sudden sharp transition at the boundary, and this contributes to high-frequency components in the Fourier transform, giving the sharp oscillations at the boundary.

The more we regularize the deconvolution, the less big the boundary effects. This is because regularization essentially acts as a low-pass filter, getting rid of high-frequency effects. However, this also blurs the image considerably. For Richardson-Lucy deconvolution we essentially have the same problem.

The straightforward to deal with this problem is to extend the image to mimmick the ‘fuzzy’ border introduces by convolution. Or better yet, we should pad the image in such a way that the image is as regular as possible when stacked periodically. This is a strategy employed by Liu and Jia, they extend the image to be periodic by using three different ‘tiles’ stacked in a pattern shown below. The image is then cropped to the dotted line, and this gives a periodic image. The tiles are optimized such that the image is continuous along each boundary, and such that the total Laplacian is minimized.

There are many similar methods in the literature. Unfortunately, all of these methods are complicated, and very few methods include a reference implementation. If there is one, it is almost always in Matlab. This seems to be a general problem when reading literature about (de)convolution and image processing, for some reason in this scientific community it is not standard practice to include code with papers, and descriptions of algorithms are often vague require significant work to translate to working code. I found a Python implementation of Liu-Jia’s algorithm at this github.

Below we see the Laplacian of the image extended using Liu-Jia’s method, using zero padding and by reflecting the image. We see that both in the reflected image, and the one using Liu-Jia’s method there are no large values of the Laplacian around the border, because of the soft transition to the border.

Next we can check if these periodic extensions of the images actually reduces boundary artifacts when deconvolving. Below we see the three methods for both the Wiener and Richardson-Lucy (RL) deconvolution in action on an image distorted with \(\sigma=3\) Gaussian blur.

We can see that the Liu-Jia’s method gives a significant improvement, especially for the Wiener filter. More strikingly, the reflective padding works even better. This is because the convolution that the distorted the image implicitly used reflective padding as well. If you change the settings of the convolution blurring the image, then the results will not be as good. Liu-Jia’s method probably works the best out-of-the box on images blurred by natural means.

It is interesting to note that Richardson-Lucy deconvolution suffers heavily in quality regardless of padding method. Interestingly, if we look at motion blur instead of Gaussian blur, the roles are a bit reversed. For the Wiener filter we have to use fairly aggressive regularization to not get too many artifacts, whereas RL deconvolution works without problems.

We have reiterated the fact that even non-blind deconvolution can be a difficult problem. The relatively simple Wiener filter in general does a good job, and changing it to use a Laplacian for regularization doesn’t seem to help much. The Richardson-Lucy algorithm often performs comparably to the Wiener filter, although it seems to perform relatively better for sparse kernels like the motion blur kernel we used.

Before we have completely ignored boundary problems, which is not something we can do with real images. Fortunately, we can deal with these issues by appropriately padding the image. Simply using reflections of the image for padding works quite well, especially depending on how we blur the image in the first place. Extending the image to be periodic while minimizing the Laplacian is more complicated, but also works well, and probably performs better in natural images.

In the next part (and hopefully final part) we will dive into some simple approaches for blind deconvolution. Starting off with a modification of the Richardson-Lucy algorithm, and then trying to use what we learned about image priors in part 2.

]]>The next step is then to try to do deconvolution if we have partial information about how the image was distorted. For example, we know that a lens is out of focus, but we don’t know exactly by how much. In that case we have only one variable to control, a scalar amount of blur (or perhaps two if the amount of blur is different in different directions). In this case we can simply try deconvolution for a few values, and look which image seems *most natural*.

Below we have the image of the St. Vitus church in my hometown distorted with gaussian blur with \(\sigma=2\), and then deblurred with several different values of \(\sigma\). Looking at these images we can see that \(\sigma=2.05\) and \(\sigma=2.29\) looks best, and \(\sigma=2.53\) is over-sharpened. The real challenge lies in finding some concrete metric to automatically decide which of these looks most natural. This is especially hard since even to the human eye this is not clear. The fact that \(\sigma=2.29\) looks very good probably means that the original image wasn’t completely sharp itself, and we don’t have a good ground truth of what it means for an image to be perfectly sharp.

Measures of naturality of an image are often called *image priors*. They can be used to define a prior distribution on the space of all images, giving higher probability to images that are natural over those that are unnatural. Often image priors are based on heuristics, and different applications need different priors.

Many simple but effective image priors rely on the observation that most images have a *sparse gradient distribution*. An *edge* in an image is a sharp transition. The *gradient* of an image measures how fast the image is changing at every point, so a an edge is region in the image where the gradient is large. The gradient of an image can be computed by convolution with different kernels. One such kernel is the Sobel kernel:

Here convolution with \(S_x\) gives the gradient in the horizontal direction, and it is large when encountering a vertical edge, since the image is then making a fast transition in the horizontal direction. Similarly \(G_y\) gives the gradient in the vertical direction. If \(X\) is our image of interest, we can then define the *gradient transformation* of \(X\) by

Below we can see this gradient transformation in action on the six images shown above:

Here we can see that the gradients become larger in magnitude as \(\sigma\) increases. For \(\sigma = 2.47\) we see that a large part of the image is detected as gradient – edges stopped being sparse at this point. For the first four images we see that the edges are sparse, with most of the image consisting of slow transitions.

Below we look at the distribution of the gradients after deconvolution with different values of \(\sigma\). We see that the distribution stays mostly the constant, slowly increasing in overall magnitude. But near \(\sigma=2\), the overall magnitude of gradients suddenly increases sharply.

This suggests that to find the optimal value of \(\sigma\) we can look at these curves and pick the value of \(\sigma\) where the gradient magnitude starts to increase quickly. This is however not very precise, and ideally we have some function which has a minimum near the optimal value of \(\sigma\). Furthermore this curve will look slightly different for different images. This is a good starting point for an image prior, but is not useful yet.

Instead of using the gradient to obtain the edges in the image, we can use the Laplacian. The gradient \(|\nabla X|\) is the first derivative of the image, whereas the Laplacian \(\Delta X\) is given by the sum of second partial derivatives of the image. Near an edge we don’t just expect the gradient to be big, but we also expect the gradient to change fast. This is because edges are usually transient, and not extended throughout space.

We can compute the Laplacian by convolving with the following kernel:

\[\begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix}\]Note that the Laplacian can take on both negative and positive values, unlike the absolute gradient transform we used before. Below we show the absolute value of the Laplacian transformed images. This looks similar to the absolute gradient, except that the increase in intensity with increasing \(\sigma\) is more pronounced.

Above we can see that there is an overall increase in the magnitude of the gradients and Laplacian as \(\sigma\) increases. We want to measure how sparse these gradient distributions are, and this has more to do with the shape of the distribution rather than the overall magnitude. To better see how the shape changes it therefore makes sense to normalize so that the total magnitude stays the same. We therefore don’t consider the distribution of the gradient \(|\nabla X|\), but rather of the normalized gradient \(|\nabla X| / \|\nabla X\|_2\). Since the mean absolute value is essentially the \(\ell_1\)-norm, this is also referred to as the \(\ell_1/\ell_2\)-norm of the gradients \(\nabla X\).

The normalized gradient distribution is plotted below as function of \(\sigma\), the distributions of the Laplacian look similar. This distribution already looks a lot more promising since the median has a minimum near the optimal value for \(\sigma\). This minimum is a passable estimate of the optimal value of \(\sigma\) for this particular image. For other images it is however not as good. Moreover the function only changes slowly around the minimum value, so it is hard to find in an optimization routine. We therefore need to come up with something better.

The \(\ell^1/\ell^2\) prior is a good starting point, but we can do better with a more complex prior based on *non-local self-similarity*. The idea is to divide the image up in many small patches of \(n\times n\) pixels with for example \(n=5\). Then for each patch we can check how many other patches in the image look similar to it. This concept is called non-local self-similarity, since it’s non-local (we compare a patch with patches throughout the entire image, not just in a neighborhood) and uses self-similarity (we look at how similar some parts of the image are to other parts of the same image; we never use an external database of images for example).

The full idea is a bit more complicated. Let’s denote each \(n\times n\) patch by

\[P(i,j) = X[ni:n(i+1),\, nj:n(j+1)].\]We consider this patch as a length-\(n^2\) vector. Moreover since we’re mostly interested in the patterns represented by the patch, and not by the overall brightness, we normalize all the patch vectors to have norm 1. We then find the closest matching \(k\) patches, minimizing the Euclidean distance:

\[\operatorname{argmin}_{i',j'} \|P(i,j) - P(i',j')\|\]Below we show an 8x8 patch in the St. Vitus image (top left) together with its 11 closest neighbors.

Note that we look at patches closest in *Euclidean distance*, this does not necessarily mean the patches are visually similar. Visually very similar patches can have large euclidean distance, for example the two patches below are orthogonal (and hence have maximal Euclidean distance), despite being visually similar. One could come up with better measures for visual similarity than Euclidean distance, probably something that is invariant under small shifts, rotations and mirroring, but this would come at an obvious cost of increased (computational) complexity.

The \(k\) closest patches together with the original patch \(P(i,j)\) are put into a \(n^2\times (k+1)\) matrix, called the *non-local self-similar (NLSS) matrix* \(N(i,j)\). We are interested in some linear-algebraic properties of this matrix. One observation is that the NLSS matrices tend to be of low rank for most patches. This essentially means that most patches tend to have other patches that look very similar to it. If all patches in \(N(i,j)\) are the same then its rank is 1, whereas if all the patches are different then \(N(i,j)\) is of maximal rank.

However, taking the rank itself is not necessarily a good measure, since it is not numerically stable. Any slight perturbation will always make the matrix of full rank. We rather work with a differentiable approximation of the rank. This approximation is based on the spectrum (singular values) of the matrix. In this case, we can consider the *nuclear norm* \(\|N(i,j)\|_*\) of \(N(i,j)\). It is defined as the sum of the singular values:

where \(\sigma_i(A)\) is the \(i\)th singular value. Below we show how the average singular values change with scale \(\sigma\) of the deconvolution kernel for the NLSS matrices for 8x8 patches with 63 neighbors (so that the NLSS matrix is square). We see that in all cases most of the energy is in the first singular value, followed by a fairly slow decay. As \(\sigma\) increases, the decay of singular values slows down. This means that the more blurry the image, the lower the *effective* rank of the NLSS matrices. As such, the nuclear norm of the NLSS matrix gives a measure of the amount of information in the picture.

We see that the spectrum of the NLSS matrices seem to give a measure of ‘amount of information’ or sparsity. Since we know that sparsity of the edges in an image gives a useful image prior, let’s compute the nuclear norm \(\|N(i,j)\|_*\) of each NLSS matrix of the gradients of the image. We can actually plot these nuclear norms as an image. Below we show this plot of nuclear norms of the NLSS matrices. We can see that the mean nuclear norm is biggest at around the ground truth value of \(\sigma\).

It is not immediately clear how to interpret the darker and lighter regions of these plots. Long straight edges seem to have smaller norms since there are many patches that look similar. Since the patches are normalized before being compared, the background tends to look a lot like random noise and hence has relatively high nuclear norm. However, we can’t skip this normalization step either, since then we mostly observe a strict increase in nuclear norms with \(\sigma\).

Repeating the same for the Laplacian gives a similar result:

Now finally to turn this into a useful image prior, we can plot how the mean nuclear norm changes with varying \(\sigma\). Both for the gradients and Laplacian of the image we see a clear maximum near \(\sigma=2\), so this looks like a useful image prior.

There are a few hyperparameters to tinker with for this image prior. There is the size of the patches taken, in practice something like 4x4 to 8x8 seems to work well for the size of images we’re dealing with. we can also lower or increase the number of neighbors computed. Finally we don’t need to divide the images into patches exactly. We can *oversample*, and put a space of less than \(n\) pixels between consecutive \(n\times n\) patches. This results in a less noisy curve of NLSS nuclear norms, at extra computational cost. We can on the other hand also *undersample* and only use a quarter of the patches, which can greatly improve speed.

The image above was made for \(6\times 6\) patches with 36 neighbors. Below we make the same plot with \(6\times 6\) patches, but only taking 1/16th of the patches and only 5 neighbors. This results in a much more noisy image, but it runs over 10x faster and still gives a useful approximation.

One final thing of note is how the NLSS matrices \(N(i,j)\) are computed. Finding the closest \(k\) patches through brute-force methods of computing the distance between each pair of patches is extremely inefficient. Fortunately there are more efficient ways to solving this *similarity search* problem. These methods usually first make an index or tree structure saving some information about all the data points. This can be used to quickly find a set of points that are close to the point of interest, and searching only within this set significantly reduces the amount of work. This is especially true if we only care about approximately finding the \(k\) closest points, since this mean we can reduce our search space even further.

We used Faiss to solve the similarity search problem, since it is fast and runs on GPU. There are many packages that do the same, some faster than others depending on the problem. There is also an implementation in `sklearn`

, but it is slower by over 2 orders of magnitude than Faiss when running on GPU for this particular situation.

At the end of the day the bottleneck in the computation speed is the computation of the nuclear norm. This in turn requires computing the singular values of tens of thousands of small matrices. Unfortunately CUDA only supports batched SVD computation of matrices of at most 32x32 in size, and indeed if we use 5x5 patches or smaller, we can make this up to 4x faster by doing the computation on GPU on my machine.

The nuclear norms of NLSS matrices seem to give a useful image prior, but to know for sure we need to test it for different images, and also for different types of kernels.

To estimate the best deconvolved image, will take the average of the optimal value for the NLSS nuclear norms of the gradient and Laplacian. This is because it seems that the Laplacian usually underestimates the ground truth value whereas the gradient usually overestimates it. Furthermore, instead of taking the global maximum as optimal value, we take the *first maximum*. When we oversharpen the image a lot, the strange artifacts we get can actually result in a large NLSS nuclear norm. It can be a bit tricky to detect a local maximum, and if the initial blur is too much then the prior seems not to work very well.

First let’s try to do semi-blind deconvolution for Gaussian kernels. That is, we know that the image was blurred with a Gaussian kernel, but we don’t know with what parameters. We do this for a smaller and a larger value for the standard deviation \(\sigma\), and notice that for smaller \(\sigma\) the recovery is excellent, but once \(\sigma\) becomes too large the recovery fails.

All the images we use are from the COCO 2017 dataset.

First up is an image of a bear, blurred with \(\sigma=2\) Gaussian kernel. Deblurring this is easy, and not very sensitive on the hyper parameters used.

Here is the same image of the bear, but now blurred with \(\sigma=4\), and it becomes much harder to recover the image. I found that the only way to do it is to reduce the patch size all the way to \(2\times 2\), for higher patch sizes the image can’t be accurately recovered and it always overestimates the value of \(\sigma\).

Below is a picture of some food. For \(\sigma=3\) recovery is excellent, and again not strongly dependent on hyperparameters. For \(\sigma=4\) the problem becomes significantly harder, and it again takes a small patch size for reasonable results.

Now let’s change the blur kernel to an idealized motion blur kernel. Here the point spread function is a line segment of some specified length and thickness, as shown below:

The way I construct these point spread functions is by rasterizing an image of a line segment. I’m sure there’s a better way to do this, but it seems to work fine. The parameters of the kernel are the angle, the length of the line segment and the size of the kernel.

Let’s try to apply the method on a picture of some cows below:

Unfortunately our current method doesn’t work well with this kind of point spread function. The nuclear norm of the NLSS matrices is very noisy. I first thought this could be because the PSF doesn’t change continuously with the length of the line segment. But I ruled this out by hard-coding a diagonal line segment in such a way that it changes continuously, and it looks just as bad.

Instead it seems that the (non-blind) deconvolution method itself doesn’t work well for this kernel. Below we see the image blurred with a length 5 diagonal motion blur, and then deconvolved with different values. With the Gaussian blur we only saw significant deconvolution artifacts if we try to oversharpen an image. Here we see very significant artifacts even if the length parameter is less than 5. I think this is because the point spread function is very discontinuous, and hence its Fourier transform is very irregular.

Additionally, the effect of motion blur on edges is different than that of Gaussian blur. If the edge is parallel to the motion blur, it is not affected or even enhanced. On the other hand, if an edge is orthogonal to the direction of motion blur, the edge is destroyed quickly. This may mean that the sparse gradient prior is not as effective as for Gaussian blur. We have no good way to check this however before improving the deconvolution method.

Having a good image prior is vital for blind deconvolution. Making a good image prior is however quite difficult. Most image priors are based on the idea that natural images have sparsely distributed gradients. We observed that the simple and easy-to-compute \(\ell_1/\ell_2\) prior does a decent job, but isn’t quite good enough. The more complex NLSS nuclear norm prior does a much better job. Using this prior we can do partially blind deconvolution, sharpening an image blurred with Gaussian blur.

However, another vital ingredient for blind deconvolution is good non-blind deconvolution. The current non-blind deconvolution method we introduced in the last part doesn’t work well for non-continuos or sparse point spread functions. There are also problems with artifacts at the boundaries of the image (which I have hidden for now by essentially cheating). This means that if we want to do good blind deconvolution, we first need to revisit non-blind deconvolution and improve our methods.

]]>- Part I: Introduction to convolution and deconvolution
- Part II: Comparing different image priors on a toy problem
- Part III: A deep look at blind deconvolution, and implementing it ourselves

Without further ado, let’s figure out what (blind) deconvolution is in the first place!

There are many types of blur that can be applied to images, but there are arguably two main types.
The first is lens blur, coming from the lens not being perfectly in focus or from imperfections in
the optics. And the second is motion blur, which is caused by the camera or the photographed object
moving. Both of these types of blur can be described by convolution of the image \(x\) with a
*kernel* or *point spread function* (PSF) \(k\):

One particular PSF is the delta function, whose only nonzero entry is \(\delta[0,0]=1\). It is the identity operation for convolution:

\[x*\delta = x.\]Often point pread functions have finite support; they are only non-zero for a finite number of
entries. In this case we can write the PSF as a matrix, where the *middle* entry corresponds to
\(k[0,0]\). In this case the delta function is a \(1\times 1\) matrix with \(1\) as it’s only entry.

Another very simple (but not necessarily natural) PSF is given by a constant matrix. For example,

\[k = \frac19\begin{pmatrix}1&1&1\\1&1&1\\1&1&1\end{pmatrix}.\]Here we divide by 9 so that the total sum of entries of \(k\) is 1. This is useful so that convolution with \(k\) preserves the magnitude of \(x\). If not, then the image would become brighter or dimmer after convolution, which we don’t want.

Convolution with a matrix like this has a name; it’s called box blur. It’s a very simple type of blur which replaces each pixel by an average of it’s neighboring pixels. It’s main use is that it’s very fast and easy implement, and to the human eye looks quite a lot like other types of blur.

Lens blur can be approximated by a Gaussian PSF, i.e. a kernel \(k\) such that

\[k[i,j] \propto \exp\left(-\frac{i^2+j^2}{2\sigma^2}\right),\]for some \(\sigma\). With \(\sigma=1\) the magnitude will decay by one standard deviation per pixel. Visually this looks quite similar to box blur, especially for smaller amounts of blur, but Gaussian blur is smoother, and more accurately emulates lens blur.

Motion blur can be described by PSF which, when seen as an image, is a line segment. For example a horizontal line segment through the middle of the PSF is equivalent to camera motion in the horizontal axis. For real life-motion blur this is really only true if the entire scene is equally far away, for example if we consider a spacecraft in orbit taking photos of the Earth’s surface. Otherwise the amount and direction of motion blur is not uniform throughout the image.

We will apply all types of blur to a cropped and scaled image of the St. Vitus church in my hometown taken in 1946 (image credit: Koos Raucamp).

The first image shows a delta PSF. The top-right shows box blur with a \(3\times 3\) box. The bottom-left image shows Gaussian blur, with \(\sigma=1\). Finally the bottom-right image shows motion blur with a top-left to bottom-right diagonal line segment of 5 pixels in length.

There is a remarkable relationship between the Fourier transform and convolution, both in the discrete and continuous case. Recall that the discrete Fourier transform (in one dimension) of a signal \(f\) of length \(N\) is defined by

\[\mathcal F(f)[k] = \sum_n f[n]\exp\left(\frac{-i2\pi kn}{N}\right).\]The Fourier transform turns convolution into (pointwise) multiplication:

\[\mathcal F(f*g)[k] = \mathcal F(f)[k]\cdot\mathcal F(g)[k].\]*(This does ignore some issues related to the fact that the signals we consider are not periodic,
and we may need to pad the result with zeros and use appropriate normalization. This result is
actually very easy to prove, although the details are not important right now.)*

This is a very useful property. For one, discrete Fourier transformations can be computed much faster than naively expected using the fast Fourier transform (FFT) algorithm. Naively applying the definition of the discrete Fourier transform to a length \(N\) signal requires \(O(N^2)\) operations, but the FFT runs in \(O(N\log N)\). It does this by recursively splitting the signal in two; an ‘odd’ and ‘even’ part, and it computes the FFT for both halves and then combines the result to get the FFT of the entire signal. We can use the speed of the FFT to compute the convolution of two length \(N\) signals in \(O(N\log N)\) as well, simply by doing

\[f*g = \mathcal F^{-1}(\mathcal F(f)\cdot \mathcal F(g)).\]Another thing is that it makes arithmetic with convolution much easier. For example we can use it to
*deconvolve* a signal. That is, we can solve the following problem for \(x\):

We take the discrete Fourier transform on both sides:

\[\mathcal F(y) = \mathcal F(x)\cdot \mathcal F(k).\]Then we divide and take the inverse discrete Fourier transform to obtain:

\[x = \mathcal F^{-1}\left(\frac{\mathcal F(y)}{\mathcal F(k)}\right).\]And indeed this works! However, it requires knowing the kernel \(k\) *exactly*. If it is even
slightly off, we can get strange results. Below we see an original image in the top left. Then on
the top right a version with Gaussian blur with \(\sigma=2\). Then on the bottom we respectively
deconvolute with a Gaussian PSF with \(\sigma=2\) and \(\sigma=2.01\). The first looks identical to
the original image, but then the second doesn’t look similar at all!

What is going on here? A quick look at the discrete Fourier transform of the PSF gives us the answer. Recall that the Fourier transform of a real signal is actually complex, so below we plot the absolute value of the Fourier transform on a logarithmic scale. For reference we also plot the fourier transform of the original and blurred signals.

We see that the Fourier transform of the kernel has many values close to \(0\). This means that dividing by such a signal is not numerically stable. Indeed if we slightly perturb either the kernel \(k\) or the blurred signal \(y\), we can end up with strange results, as seen above.

If we want to do deconvolution, we clearly need something more numerically stable than the naive algorithm of dividing the Fourier-transformed signals. This means putting some kind of regularization that makes the solution look more natural. Above, our main problem is that the Fourier transform of \(k\) has values close to zero, so one thing we can try is to add a small number to \(\mathcal F(k)\) before division. One problem here is that \(\mathcal F(k)\) is complex, so it’s not immediately clear how to add a number to make it nonzero. However, note that we can write

\[\frac{1}{\mathcal F(k)} = \frac{\mathcal F(k)^*}{\mathcal F(k)\mathcal F(k)^*} = \frac{\mathcal F(k)^*}{|\mathcal F(k)|^2}\]In this formula any numerical instability is coming from the the division by \(|\mathcal F(k)|^2\). This is always a positive real number, so we can move it away from zero by adding a constant. This gives us the following formula for deconvolution:

\[x = \mathcal F\left(\mathcal F(y) \cdot \frac{\mathcal F(k)^*}{|\mathcal F(k)|^2+S}\right)^{-1},\]where \(S>0\) is a regularization constant. Let’s see how well this works for different values of \(S\):

If you look closely, the image looks best for \(S=10^{-8}\). For lower and higher values we see a ringing effect, particularly noticeable in portion of the image occupied by the sky. Visually the best deconvoluted image looks indistinguisble from the original. However if we look at the discrete Fourier transform of the same images, they actually look quite a bit different. (The difference is however exaggerated by the logarithmic scale). There are significant artifacts remaining from the near-zero values of the Fourier transform of the PSF

Given that I do research in numerical linear algebra, it might be interesting to cast the deconvolution problem into linear algebra. Note that we’re essentially solving the minimization problem

\[\min_x \|k*x-y\|^2\]Since \(k*x\) is linear in all the entries of \(x\), we can actually write this as matrix
multiplication \(k*x = Kx\), where \(K\) is the *convolution matrix*. For one-dimensional
convolution with a kernel \(k\) this matrix is \(K_{ij} = k[i-j]\). Using the convolution matrix we
can turn deconvolution into a linear least-squares problem, and deconvolution using Fourier
transforms gives the exact minimizer of this problem. The reason this exact solution becomes garbage
as soon as we slightly perturb \(y\) or \(k\) is because the matrix \(K\) is very ill-conditioned.
The *condition number* of a matrix \(K\) tells us how much any numerical errors in a vector \(b\)
can get amplified if we’re trying to solve the linear system \(Kx = b\).

Fortunately there are ways to deal with ill-conditioned systems through regularization. There are a number of regularization techniques, but in our case this isn’t immediately helpful because of the size of the matrix \(K\). If we consider an \(n\times m\) image, then the matrix \(K\) is of size \(nm\times nm\). For example if we have a \(1024\times 1024\) image then the image requires on the order of 1MB of memory, but the matrix \(K\) would take up on the order of 1TB of memory! Obviously that will not fit in the memory of a typical home computer, so working directly with the matrix \(K\) is completely infeasible. Moreoever, while the matrix \(K\) has a lot of structure, it is not sparse, so we cannot store it as a sparse matrix either.

Nevertheless computing a matrix product \(Kx\) is cheap, since it’s just convolution. There are good linear solvers that only need matrix-vector products, without ever forming the matrix explicitly. These are usually iterative Krylov subspace methods. Fortunately scipy has several such solvers, and out of those implemented there it seems that the LGMRES (Loose Generalized Minimal Residual Method) solver works best for this particular problem. Even without regularization this produces decent results. Nevertheless, it’s a bit finicky to get working well, and on my machine the deconvolution takes a full minute, as opposed to a few milliseconds for FFT-based deconvolution.

We can undo blurring caused by convolution if we know the point spread function. Naively performing deconvolution using discrete Fourier transforms is not numerically stable, but we can improve the numerical stability. Nevertheless, unless we know the point-spread function with very high precision, the result is not perfect, as is evident from the Fourier transforms.

In the next part we will start with blind deconvolution. In that case we don’t know the point spread function, so we need to deconvolve with a number of different kernels and iterate towards an approximation of the true PSF. The biggest problem at hand is to have an objective that tells us which deconvolved image ‘looks more natural’. It is not clear a priori what the best way to measure this is, and we will look at several approaches to this problem. Then in the final part we will try one or two algorithms of blind deconvolution.

]]>Fortunately obtaining a time series of your email traffic is very easy. You can download a .mbox
file with all your emails. Such a file can easily be processed using the `mailbox`

package in the
Python standard library. I made a short script that loads a .mbox email archive and extracts some
metadata for all the emails, including the time at which it was sent. Maybe I’ll use the other
metadata for some other project sometime, but for now let’s focus on the timestamps of when the
email was sent.

By looking at specific components of the time series we can discover some basic trends. In principle we can model trends as as a sum of trends on different timescales. For example the entire timeseries has components in the scales:

- Time of day
- Day of week
- Time of year
- Global (non-periodic) trends

We can look at these seperately, but a more accurate model would models these all at the same time. Gelman et al. describes how to do this using Bayesian statistics, and ti would be good to try adapting their methods, but for now we’ll just use a package instead.

We can get a useful timeseries by counting the total number of emails received each days. Plotting this timeseries is however not very useful, because it is extremely noisy. To look at patterns in the data we need to smoothen it. This is done by applying some kind of low-pass filter, and there are many choices for a filter. Very popular is to use a rolling mean, but I personally prefer to use a Gaussian filter since the final result looks smoother. In the signal processing literature people would prefer using filters such as a Butterworth filter. At the end of the day, we’re mainly using the filters for the purpose of plotting so it isn’t too important.

Below is a plot of the email timeseries with a Gaussian filter with a standard deviation of 60 days (blue) and 15 days (gray). We can see that I receive about 4-8 emails per day on average. This does not include any spam, since these emails eventually get deleted and are therefore not in the email archive. We can see there are a significant spike in activity in 2010, and an increasing trend over the past couple years. We can also see a lot of local fluctuations, and as we shall see these can be largely attributed to a fairly regular seasonal variation.

Unsurprisingly I receive less emails on the weekend. Interestingly emails are nearly equally common on Tuesday through Friday, but less on Mondays.

Below is a plot of seasonal trends, the blue line is smoothed with a standard deviation of 7 days, the gray dots with 1 day. We can see two dips, one around new year and one in summer, both times of vacation. There are also monthly oscillations, and a peak before and after summer, and before the winter holidays. I don’t have a satisfying explanation for this.

The daily trend shows some very clear patterns as well. Here the blue line is smoothened with a standard deviation of 15 minutes, and the gray line with 3 minutes, the times are all in UTC.

We can clearly see that most activity is concentrated between 9:00 and 15:00. We then see two decreases at around 15:00, and 17:00. The first probably corresponds to the end of the working day (during summertime in the Netherlands / Switzerland), the second drop may also correspond to the end of the working day but for emails whose timestamps lack timezone information. We then see reduced activity, which starts to taper off even further from about 21;00 onward. This may correspond in part to email sent during the American working day, and in part in the European evening. Then finally there is very low activity during the night between roughly 23:00 and 5:00.

On the gray curve we can also see a peak corresponding at each hour mark, which are probably all caused by emails scheduled to go out at a particular time.

Rather than looking at each timescale separately as we have done so far, we can model the different time scales at the same time in an additive model. In a simple model we will model our signal \(f(t)\) as

\[f(t) = f_{\mathrm{week}}(t)+f_{\mathrm{year}}(t)+f_{\mathrm{trend}}(t)+\epsilon(t)\]where the first term has a 7-day period, the second term a 365 (or 366) day period, and the third term is only allowed to change slowly (e.g. once every few months). Finally we assume a constant Gaussian noise term for the residuals of our model, which we don’t assume to be constant in magnitude, but always centered at 0. All of the components in our model can be taken to be a Gaussian process (even the magnitude of the noise). The details on Gaussian processes and how to fit them are perhaps nice for another blog post, but for the time being we will use a package to do all the work for us. We will be using Prophet, which is developed by Facebook. Its main use is predicting the future of time series, but it also works fine just for modeling time series.

The resulting model seems quite similar to what we have already discovered previously. The global trend in particular is a bit less oscillatory, but the weekly and seasonal trends are nearly identical.

Next we can wonder how accurate this model is. The model assumes that the noise, and hence the residuals, are normally distributed. Let’s try to see how well this assumption holds up by analyzing the distribution of the residuals. In a normal distribution, a distance of 1 standard deviation to the mean corresponds to the quantiles of 0.159 and 0.841 respectively. And similarly a distance of 2 standard deviation from the origin corresponds to quantiles of 0.023 an 0.977 respectively. Finally, the median and mean should coincide. We can therefore compute these quantiles in a rolling fashion, and normalize by dividing by the standard deviation. If the residuals are normally distributed, these rolling normalized quantiles should stay close to horizontal integer lines.

Below we plotted just that, with a rolling window of 200 days. We can see that the rolling median, and the rolling quantiles corresponding to one standard deviation, both correspond well to a normal distribution. We do see a bit of deviation between 2009 and 2011, which is likely caused by the sudden spike around the start of 2010, which seems a bit of an outlier.

The 2-standard deviation rolling quantiles seem skewed towards bigger values, however. This is because there are many days with very large spikes in email traffic, and the global distribution of email traffic is not symmetric either. Furthermore we are dealing with strictly positive data (I can’t receive a negative number of emails), this in itself means that the residuals of any models are not going to be normally distributed. Therefore the model’s assumptions are invalid, and a more accurate model would make more accurate assumptions about the distribution of the residuals.

However, assuming normality of the residuals tends to make computations much more easy, and a model with a more accurate noise model might be difficult to fit, especially on large amounts of data. I might try to do this in a future blog post. We are dealing here with counting data (namely the number of emails in a given day). Such data is often modeled by a Poisson distribution rather than a normal distribution. The main assumption of a Poisson distribution is that the events of arriving emails are all independent. This is probably not the case, but we can either way see how well this assumption holds up.

Finally let’s try to get a deeper understanding of time series by considering the distribution of time between consecutive emails. Having a good understanding of this can help to model the time series better. If we model the arrival times of all emails to be independent, except for a global variation in rate, we are naturally lead to model the time \(T\) between consecutive emails by an exponential distribution:

\[T_t\sim \exp(\lambda(t))\]where \(\lambda(t)\) is a rate parameter that depends on time, since we already established that the rate at which we receive emails is not constant over time.

If we divide an exponential distribution by its mean, it will always me an exponential distribution with unit rate. We can use this to obtain a similar plot to the plot of the residuals. We will divide the time series of time between consecutive emails by a rolling mean, and then we will plot the rolling quantiles of the resulting data. These can then be compared to the quantiles of a standard exponential distribution.

This is done in the plot below, and we can clearly see that the distribution of time between consecutive emails is not exponentially distributed. The distribution is much more concentrated in low values than expected from an exponential distribution. It also seems to have a bit longer tail than predicted by an exponential distribution (although this is harder to see in this plot). This is because emails are not independent. For instance, if you’re having an active conversation with someone you might get a lot of emails in a short amount of time, but most of the time emails come in at a slower rate. Furthermore there are quite a number of times that emails arrive at the exact same second, which should have very low probability under an exponential model.

One can try fitting different distributions to this data. For example a gamma distribution has a better fit, but still does not properly model the probabilities of very small time intervals. Perhaps a mixture of several gamma distributions would fit the distribution of the data well, but this kind of distribution is hard to interpret. A good statistical model should have a good theoretical justification as well.

We conclude the analysis of this email time series for now. I can’t say that I have learned anything useful about my own email traffic, but the analysis itself was very interesting to me. It can be interesting to dive into data like this and really try to understand what’s going on. To not only model the data (which could be useful for predictions), but to also dive deeper into the shortcomings of the model. I will hopefully get back to this time series and come up with a more accurate model that makes more realistic assumptions about the data. The only way to come up with such models is to first understand the data itself better.

]]>