Rust.
That’s right; grab your sunglasses. We’re about to join the cool kids by re-implementing our code in Rust.
At my workplace, our primary product uses multiple cameras to track people in 3D in real-time. Each camera is designed to detect keypoints—think head, feet, knees, and so on—for each person in view, multiple times per second. These detections are then sent to a centralized application. Since we know the spatial positioning of the cameras, we can triangulate the position of these keypoints by integrating data from multiple cameras.
To grasp how we consolidate information from various cameras, let’s first look at what we can glean from just one camera. If we detect, for instance, a person’s nose in a camera frame (a 2D image), we know the direction in which the nose points from the camera’s perspective—but we can’t determine how far away it is. In geometric terms, we have a line of possible locations for the person’s nose in the 3D world.
We could make an educated guess about the distance based on the average human size, but that’s difficult, error-prone, and imprecise. A more accurate approach is to use multiple cameras. Each camera provides a line of possible locations for a keypoint; thus, by using just two cameras, we can find the intersection of those lines to pinpoint the keypoint’s 3D location.
However, this approach is not foolproof; several sources of uncertainty can compromise the accuracy:
Adding more cameras could alleviate some of these issues. More lines mean a better intersection point, which can be found by minimizing a (weighted) least squares formula. But if you are like me, then you’re probably just about ready to scream the true solution:
Let’s go Bayesian
In our situation, a Kalman filter can essentially be understood in these terms:
We start with an estimate of both the position and velocity of a keypoint, along with a certain degree of uncertainty.
As time progresses, this estimate changes—especially if we have an idea of the direction in which the keypoint is moving. However, the uncertainty associated with both the position and velocity will invariably increase. To illustrate, if I see someone moving and then close my eyes, I might guess their location 100 milliseconds later. But if you ask about their location or speed next Tuesday, I’d be clueless.
Each new observation serves to update our existing knowledge about the keypoint’s position and velocity. Due to the inherent imprecision of all observations, the updated estimates are a blend of our previous estimate and the new observation.
Formalize these steps and toss in some Gaussians, and you’ve got yourself a Kalman filter! We’re going to dive into the specifics right now, but if you’d rather not get into the math, feel free to skip ahead to the </math>
tag.
<math>
The Kalman filter can be thought of as a two-step process: the prediction step and the update step.
To warm up, let’s focus on how to model a single, static keypoint. Imagine we have a position \(x\in\mathbb R^3\) along with a \(3\times 3\) covariance matrix \(P\). These represent the mean and variance of a random variable \(X(0)\) at the initial time \(t=0\).
As time passes, our estimated position \(x\) remains unchanged, but the covariance matrix \(P\) should grow. To model this, we can turn to Brownian motion and write \(X(t) = X(0) + \eta W(t)\) where \(W(t)\) represents a Wiener process (or Brownian motion), and \(\eta>0\) is the noise level. The expectaton for this is \(\mathbb E(X(t))=X(0)=x\), while the variance evolves according to:
\[\mathrm{Var}(X(t))= \mathrm{Var}(X(0)))+\mathrm{Var}(W(t))=P+\eta tI.\]Things get more complicated when we introduce a velocity parameter \(v\in \mathbb R^3\). Viewing this from the perspective of a stochastic differential equation (SDE), the original equation was
\[\mathrm dX(t) = \eta \mathrm dW(t).\]Adding in the velocity \(Y(t)\) this then becomes
\[\begin{cases} \mathrm dX(t) = Y(t)\mathrm dt+\eta_x \mathrm dW(t) \\ \mathrm dY(t) = \eta_v \mathrm dW(t).j \end{cases}\]We can integrate this by first integrating \(Y(t)\) to obtain
\[Y(t) = Y(0)+\eta_v W(t)\]Which we then can substitue to get
\[\mathrm dX(t) = Y(0)+\eta_vW(t)+\eta_x\mathrm dW(t)\]Finally integrating this we obtain
\[X(t) = X(0)+tY(0)+\eta_xW(t)+\eta_v\int_0^t\!W(s)\,\mathrm ds\]You’ll notice the last term is not Gaussian. Still, with some standard mathematical tricks (Itô’s Lemma), we find:
\[\mathbb E\left[\int_0^t\!W(s)\,\mathrm ds\right] = 0,\qquad\mathrm{Var}\left[\int_0^t\!W(s)\,\mathrm ds\right]=t^3/3\]In summary, our predict step can be represented as:
\[\begin{align*} \hat x_k&\leftarrow x_{k-1}+v_{k-1}t \\ \hat v_k&\leftarrow v_{k-1} \\ \hat P_{x,k}&\leftarrow P_{x,k-1}+t^2P_y+\eta_xtI+(\eta_vt^3/3)I\\ P_y&\leftarrow P_y+\eta_vtI \end{align*}\]While the math wasn’t overly complicated, the intricacy quickly scales with more complex stochastic differential equations. Analytical integration may not always be feasible, making numerical or Monte-Carlo methods an alternative, albeit costly, approach.
Fortunately, if our prediction is linear, we can simplify things considerably. In our case, the function \((x,v)\mapsto (x+vt,v)\) is indeed linear. Let’s denote this map by \(F_t\). Now, we encapsulate both \((x_k,v_k)\) together into a single variable \(x_k\in\mathbb R^6\), with the covariance \(P\) being a \(6\times 6\) matrix. Ignoring the \((\eta_vt^3/3)I\) term, the update step simplifies to:
\[\begin{align*} \hat x_k&\leftarrow F_t\overline{x_{k-1}}\\ \hat P_k&\leftarrow F_tP_{k-1}\,F_t^\top + Q, \end{align*}\]where \(Q=\mathrm{diag}(\eta_x,\eta_x,\eta_x,\eta_v,\eta_v,\eta_v)\). We use \(\overline{x_{k-1}}\) for the estimate of the state at step \(k-1\), acknowledging that we don’t know the actual value. As long as long as our model function \(F_t\) is linear, this approximation will suffice. However if \(F_t\) is nonlinear, then we’re in trouble, but we’ll see more on that later.
After the prediction step, we move from an initial state \(x_{k-1}\) to an estimated \(\hat x_k\). This is an estimate of the true state \(x_k\) of the system at time \(t_k\), based on our estimate at time \(t_{k-1}\). At the same time, we observe the system and obtain a measurement \(z_k\) of the system.
Unfortunately the measurement \(z_k\) does not live in the same space as the state \(x_k\). For example, \(z_k\) might represent a 2D observation while \(x_k\) represents a position+velocity in 3D.
To go from the ‘model space’ to the ‘measurement space’ we introduce a matrix \(H\). In the context of tracking a koving keypoint for instance, this matrix is a \(3\times 6\) matrix given blockwise as \((I_3\,\mathbf 0)\) where \(\mathbf 0\) is a \(3\times 3\) matrixof zeros. This matrix is simply the act of ‘forgetting’ the velocity and only keeping the position.
Our goal is now to derive an estimate \((\overline x_k, P_k)\) for the state \(x_k\) at time \(t_k\) by incorporating both the previous estimate \(\hat x_k\) and the observation \(z_k\). In order to do that, we lean on 3 assumptions:
The last assumption is not really an assumption of the model, but more a convenience to make the math work out. With this assumption we can derive an ‘optimal’ estimate of the true state \(x_k\). None of these assumptions are realistic. For instance, in our simple Bayesian analysis of the prediction step we have already noted that the prediction step used by the Kalman filter is an approximation. The second point assumes that the measurement error is exactly Guassian, which is rarely true in practice. Nevertheless, these assumptions make for a good model that can actually be used for practical computations.
Deriving the optimal value of the Kalman gain \(K\) is a little tricky, but we can relatively easily use the assumptions above to find a formula for the new error estimate \(P_k\). We can define \(P_k\) as \(\mathrm{cov}(\overline x_k - x_k)\). Then we make several observations:
The rest is a straightforward computation using properties of covariance:
\[\begin{align*} P_k &= \mathrm{cov}(\overline x_k - x_k)\\ &=\mathrm{cov}\left(\hat x_k -x_k + K(z_k-H\hat x_k) \right) \\ &=\mathrm{cov}\left(\hat x_k -x_k + K(Hx_k-H\hat x_k+v) \right) \\ &=\mathrm{cov}\left((I-K)(\hat x_k -x_k) + Kv \right) \\ &=\mathrm{cov}((I-K)(\hat x_k -x_k)) + \mathrm{cov}(Kv) \\ &=(I-K)\hat P_k(I-K)^\top + KRK^\top \end{align*}\]Now, let’s tackle the question of how to calculate the optimal value for the Kalman gain. To do this, we first need to clarify what “optimal” signifies in this setting. In this context optimal refers to the value of \(K\) that minimizes the expected residual squared error \(\mathbb{E}\|\overline x_k-x_k\|^2\). Using some tricks from statistics and matrix calculus it turns out that the opimal Kalman gain takes the following fomr:
\[K = \hat P_kHS_k^{-1},\]where
\[S_k = H\hat P_kH^\top + R,\]which is just the covariance matrix of the residual \(z_k-H\hat x_k\).
In summary, using several assumptions we can use prior estimate \(\hat x_{k}\) and an observation \(z_k\) to get an improved estimate \(\overline x_k\). This is all a Kalman filter is: we alternate predict and update steps to always have an up-to-date estimate of the models state.
To derive the prediction and update steps of the Kalman filter we had to make one ‘stinky’ assumption. It has nothing to do with the approximation we made in the prediction step, or the 3 assumptions I listed in the update step. Rather it’s the assumption that the transition map \(F_t\) and the measurement map \(H\) are linear maps.
In the case of a moving keypoint both \(F_t\) and \(H\) were linear. However this is not always the case in more complex systems. Take camera observations as an example. Even ignoring lens distortion, the measurement \(H\) is still a projective transformation, not a linear one. To understand why, consider the fact that if we move a point very close to a camera 1cm to the left, it might move many pixels on the image. In contrast,if that same point is several meters away then the same movement will result in a change of 1-2 pixels at best.
So why do we need the linearity assumption? Simply put, if \(x\) is Gaussian then so are \(F_tx\) and \(Hx\), but this is only true for linear maps. One can approximate \(F_tx\) and \(Hx\) by a Gaussian using a linear approximation of the functions \(F_t\) and \(H\). However, has its downsides: a) the approximation may be innacurate, and b) computing the linear approximaton requires computing the Jacobian, which may be challenging and computationally expensive.
In summary for a non-linear function \(f\) and a gaussian \(x\sim N(\mu,\Sigma)\) we need an effective way to estimate the mean \(\mathbb E(f(x))\) and covariance \(\mathrm{cov}(f(x))\). One method that always works is Monte-Carlo estimation: we simply take lots random of samples \((x_1,\ldots,x_N)\) of \(x\), and then compute the mean and covariance of \((f(x_1),\ldots,f(x_N))\). The only issue with this is that we need many samples in order to get an accurate estimate.
Why rely on random sampling to estimate the mean and covariance of \(f(x)\), when we can go with deterministic sampling and pick a robust set of samples \((s_1,\dots,s_N)\). These points are called sigma points. Using these, we can estimate the mean of \(f(x)\) through a weighted mean \(\mu=\sum_{i=1}^Nf(s_i)W^a_i\). To estimate the covariance we use \(\mu\) and a second set of weights to get: \(\Sigma = \sum_i W_i^c (f(s_i)-\mu)(f(s_i)-\mu)^\top\). This method is known as the unscented transform (UT). It takes a mean and covariance of a Gaussian \(X\) and uses sigma points to estimte the mean and covariance of the transformed variable \(f(X)\).
So, how do we pick these sigma points \(s_i\) and the associated weights \(W^a\) and \(W^c\)? Technically, any set of points could work, so long as, if \(f\) is the identity function, we get the original mean and covariance back. However, most people use a particular algorithm developed by van der Merwe. This algorithm uses three parameters \(\alpha, \beta,\kappa\) . Given input mean and covariance \((\mu,\Sigma)\), the algorithm defines:
\[s_i = \begin{cases} \mu & (i=0); \\ \mu + \left[\sqrt{(n+\lambda)\Sigma}\right]_i & (i=1,\ldots,n); \\ \mu - \left[\sqrt{(n+\lambda)\Sigma}\right]_{i-n} & (i=n+1,\ldots,2n). \end{cases}\]Here, \(n\) is the dimension of \(\mu\) and \(\lambda := \alpha^2(n+\kappa)-n\). Note the use of the matrix square root which is well-defined for symmetric positive semidefinite matrices – i.e. covariance matrices. You can calculate the matrix square root for example using the singular value decompositon or the Cholesky decomposition. Here are the equations for the weights:
\[\begin{align*} W_0^m &=\frac{\lambda}{n+\lambda} \\ W_0^c &=\frac{\lambda}{n+\lambda} +1-\alpha^2+\beta\\ W_i^m=W_i^c &=\frac{1}{2(n+\lambda)} \qquad (i=1,\ldots,2n) \end{align*}\]Since these weights don’t depend on \(\mu,\Sigma\), we only have to compute them once.
In summary, the unscented transform provides a good estimate of the mean and covariance of \(f(X)\) at the cost of:
In return:
On top of that, the implementation is not difficult.
Finally you may wonder where the name ‘unscented’ came from. As I alluded to, this is simply because creator, Jeffrey Uhlmann, thought the algorithm was cool and “doesn’t stink”.
Armed with the unscented transform, you only need minor modifications to the Kalman filter algorithm to account for non-linear functions. Let’s first look at the predict step as it used to be: \(\begin{align*} \hat x_k&\leftarrow F_tx_{k-1}\\ \hat P_k&\leftarrow F_tP_{k-1}\,F_t^\top + Q, \end{align*}\)
What happens here is that we have as input a Gaussian \((x,P)\), we then transform it with \(F_t\) to get a Gaussian \((F_tx,\, F_tPF_t^\top)\), and then finally we add some extra noise in the shape of \(Q\).
Instead of a linear map \(F_t\), we now have a non-linear ‘process model’ \(f_t\). We just have to swap the estimates of the mean and covariance with those provided by the unscented transform:
\[\begin{align*} \hat x_k&\leftarrow\sum_{i=1}^Nf_t(s_i)W^a_i,\\ \hat P_k&\leftarrow \sum_i W_i^c (f_t(s_i)-\hat x_k)(f_t(s_i)-\hat x_k)^\top + Q. \end{align*}\]Here, \(s_i\) are the sigma points derived from \((x_{k-1},P_{k-1})\). While this seems complex, we can rewrite it using the unscented transform \(\mathrm{UT}[f_t]\) to get:
\[\begin{align*} (\hat x_{k},\hat P_{k})\leftarrow \mathrm{UT}[f_t] (x_{k-1},P_{k-1})+(0, Q)\\ \end{align*}\]That’s not too bad! This also shows you could actually swap out the unscented transform for any method of estimating the mean/covariance of \(f_t(X)\).
Then what about the predict step? We have our estimate \((\hat x_k,\hat P_k)\) and an observation \((z_k, R)\). Instead of a measurment matrix \(H\), we have a measurment function \(h\). We start by applying the unscented transform to the estimate to put it in the measurement space:
\[(\mu_z,\Sigma_z)\leftarrow \mathrm{UT}[h](\hat x_k,\hat P_k)+(0,R)\]Then we calculate the ‘cross-variance’ beween \((\mu_z,\Sigma_z)\) and \((\hat x_k,\hat P_k)\). If \(\{s_i\}\) are the sigma points associated to \(\mathrm{UT}[h](\hat x_k,\hat P_k)\), then this cross-variacne \(P_{xz}\) is defined by:
\[P_{xz} = \sum_i W_i^c(s_i-\hat x_k)(h(s_i)-\mu_z)^\top\]The cross-variance takes on the role of the matrix \(S\) in the original kalman filter. From here we find the Kalman gain as:
\[K=P_{xz} P_{z}^{-1}\]Ffinally our new estimate \((x_k,P_k)\) becomes:
\[\begin{align*} x_k&\leftarrow \hat x_k+K(z_k-\mu_z)\\ P_k&\leftarrow \hat P_k-KP_{z}K^\top \end{align*}\]And so the unscented kalman filter is born! This concludes the mathematical part of this blog post. Coming up next, we’ll dive into my implemenaton.
</math>
<code>
Let’s put the unscented Kalman filter to work on an actual problem. Not just to see Kalman filters in action, but also to understand why I needed to speed it up.
Let’s go back to the beginning of this post. We’re dealing with multiple cameras capturing a person’s movement. A machine learning algorithm is kind enough to give us the ‘keypoints’ on the skeleton of the person (e.g. nose, left elbow, right knee, etc.). Our mision is to turn these 2D pixel positions to 3D coordinates.
To paint a clearer picture, I made a little simulation of a keypoint moving around in 3D projcted down to the viewpoint of two cameras. Below you can see two plots showing what each of the two cameras can see. Thanks to he noise I added the cameras don’t see a smooth curve at all. This is pretty much what you would get when applying machine learning pose detection algorithms on real footage too.
To model this problem we use a model dimension of \(6\) (that’s position plus vecocity) and a measurement dimension of 2 (the pixels on a screen).
The measurement function takes in two arguments; the position and an integer indicating the camera number (this is something that we’ll get back to later). To turn this function into an object that Rust can understand, we use the @measurement_function
decorator. The model transition function is just \((x,\,v)\mapsto (x+v\Delta t,\,v)\) (again with a decorator to tell Rust what’s going on).
dim_x = 6
dim_z = 2
@measurement_function(dim_z)
def h_py(x: np.ndarray, cam_id: int) -> np.ndarray:
pos = x[:3]
if cam_id == 0:
return cam1.world_to_screen_single(pos)
elif cam_id == 1:
return cam2.world_to_screen_single(pos)
else:
return np.zeros(2, dtype=np.float32)
@transition_function
def f_py(x: np.ndarray, dt: float) -> np.ndarray:
position = x[:3]
velocity = x[3:]
return np.concatenate([position + velocity * dt, velocity])
Next we need something to generate sigma points, as well as something to do the heavy lifting of the unscented Kalman filter algorithm. The Kalman filter algorithm needs matrices \(Q\), \(R\) and \(P\) which we’ll need t set as well.
We actually only changed the value of \(Q\) from its default, as \(P\) and \(R\) are fine if they’re an identity matrix in this case. If you have a really good theoretical model, you might be able to derive good values for \(Q\), \(R\) and \(P\). In practice though, you’re just going to have to fiddle around with it until it works.
sigma_points = SigmaPoints.merwe(dim_x, 0.5, 2, -2)
kalman_filter = UKF(dim_x, dim_z, h_py, f_py, sigma_points)
kalman_filter.Q = np.diag([1e-2] * 3 + [3e1] * 3).astype(np.float32)
kalman_filter.R = np.eye(2).astype(np.float32)
kalman_filter.P = np.eye(6).astype(np.float32)
Now we’re going to alternatively do a predict and update step from the point of view of either camera. To tell the Kalman filter which camera is making the observation we use the update_measurement_context
method. Whatever value we pass here is what’s going to get passed to the measurement function \(h\).
predictions_list = []
for p1, p2 in zip(
proj_points_obs1, proj_points_obs2
):
kalman_filter.update_measurement_context(0)
kalman_filter.predict(dt)
kalman_filter.update(p1)
predictions_list.append(kalman_filter.x)
kalman_filter.update_measurement_context(1)
kalman_filter.predict(dt)
kalman_filter.update(p2)
predictions_list.append(kalman_filter.x)
predictions = np.array(predictions_list) # type: ignore
pos_predictions = predictions[:, :3]
Then finally we are plotting the result below. We see that the Kalman filter is able to track the keypoint quite well in this problem. After taking a bit of time to settle it even gives an accurate estimate of the velocity of the keypoint!
The implementation above was actually already fully in Rust, and is a lot faster than the Python library filterpy
which I based my implementation on. But maybe you noticed that I still used Python to define the measurement and transition funcions. My Rust implementation actually calls these Python functions directly. This is fantastic if you are in a design stage (i.e., you don’t know which function you need precisely), but calling these Python functions comes with a significant overhead.
Fortunately it’s not a lot of work to make Rust implementations of these functions and use those instead. To make it more interesting, we will be tracking not just a single keypoint, but 30 keypoints in parallel. This is much closer to the actual use case because a skeleton consists of many keypoints.
Rather than looking at the accuracy, we’re just looking at speed.
Time per keypoint (Rust): 8.1 µs
Time per keypoint (Python}: 120.4 µs
Rust speedup: 14.9x
That’s 15x speedup just for changing out a python functon for a Rust one! Here I used UKFParallel
which takes a list of unscented Kalman filters as input, and allows calling the update/predict functions for each Kalman filter in a batch. This is better than than using a for loop in Python, since we can limit the time spent interacting with the GIL.
Originally the idea of UKFParallel
was to actually evaluate the predict / update functions in parallel. Unfortunately, even when the the measurement and transition functions are Rust native, we still end up waiting most of the time for Python to release the GIL. The design decision to make it very easy to use from python also means that proper parallelization requires redesigning parts of the codebase.
But more importantly, how does this compare to filterpy
?
Time per keypoint (filterpy): 176.4 µs
Rust speedup: 21.8x
We see that filterpy
is around 22x times slower than my implementation using Rust native functions. Furthermore, (on this computer) we would spend around 5ms to process 30 keypoints using the filterpy implementation (without multithreading). In practice this means that if we’re processing data from a 30fps camera stream, we would be spending roughly 1/6th of a frame just on the Kalman filter logic. With the Rust implementation this is only 1/137th of a frame – which gives us much more time for other logic. Since we’re processing data from multiple camera streams in parallel, this is a big deal!
This was my first project using Rust and I learned a lot. Not all of my experiences were positive, and I also made some design decisions which I have come to regret. Here are some of my thoughts.
I think Rusts trait system is honestly amazing. All ‘classes’ are just structs which make it extremely clear what all the data is that a ‘class’ uses in its lifetime. Abstract interaction between classes is then done by implementing certain traits.
For example in my code I defined a MeasurementFunction
trait, and the unscented Kalman filter then gets told which measurement function to use at runtime by taking a Box<dyn MeasurementFunction>
. This was very important because this way I was able to treat measurement functions defined in Python and those defined in Rust on equal footing. This feature was in fact the hardest thing for me to figure out in this project.
One of the reasons we needed this is because measurement functions need context in my usecase. In particular, when we run the update function we need to know for which camera to run this function. There are multiple ways to make this possible, but since I wanted to avoid restrictions on what this context could be, it became a bit of a headache dealing with this.
When using a Kalman filter it is normal to have different sensors (such as different cameras), which means having a different measurement function for each sensor. If the number of sensors is finite, then we don’t actually need an arbitrary object as context. We just need to keep a list of all the measurement functions, and then the only context we need is a single index. Had I made this design decision, my code would likely have been much simpler.
C/C++ with a together with a library like pybind is still the go-to for speedin up your Python code. With the PyO3 crate, Rust offers itself as a solid alternative. Defining classes and methods, calling Python code, handling errors, and dealing with the GIL is relatively easy. I think what makes this possible is Rust’s first-class macro system. Still, you do end up with a fair bit of boilerplate. For instance, you need to define getters and setters for every single attribute that you want to expose to Python. As an example, I ended up with 143 lines of code that looks like this in the unscented Kalman filter class alone:
#[getter]
#[pyo3(name = "x")]
pub fn py_get_x(&self, py: Python<'_>) -> PyResult<Py<PyArray1<Float>>> {
let array = self.x.clone().into_pyarray(py).to_owned();
Ok(array)
}
#[setter]
#[pyo3(name = "x")]
pub fn py_set_x(&mut self, x: PyReadonlyArray1<Float>) -> PyResult<()> {
self.x = x.as_array().to_owned();
Ok(())
}
I understand why it’s necessary, but that doesn’t make it fun.
Furthermore, Rust has good support for generic types. There were quite a few places where defining a struct or trait with generic types would have made my life easier, but this is not supported by PyO3 (at least for now).
As the ecosystem matures a little, I have no doubt things are going to improve even further. But for now, while it isn’t difficult per se, it still can be a bit tedious to write Python modules in Rust.
One reason why I really enjoy writing numerical code in Python is because of numpy and its surrounding ecosystem. Rust is faster, but without good libraries for numerical array programming, it would not be so useful for me. The main library in this area is the ndarray
crate and, as a whole, it is intended to be quite similar to numpy
. However, with Rust’s strict typing and memory management, I did find it quite tricky at times.
For instance, in Python we might write x+=K@y
, but with ndarray we write x += &K.dot(&y)
, although this might change depending on whether K
or y
are an Array
, ArrayView
or ArrayViewMut
, and sometimes we have to call .view()
, .to_owned()
or .clone()
on the arrays, and sometimes we don’t. Certainly at first, but still now sometimes, it just feels like trial and error is necessary to do simple things like add or multiple two arrays. However, many of these things are good; we’re making it really explicit when a memory copy occurs, and we are also making sure that each piece of data has only one owner. This prevents bugs and improves performance. On the other hand, it can also be very frustrating at first when coming from a language like Python where you never have to worry about that (for better or for worse).
Whenever writing code were speed matters, profiling is your best friend. In this particular project I was using pyspy
, which can also inspect native code. It basically polls the state of the program 100 times per second and records which function or line of code the program was executing, even if that code was written and compiled in another language like Rust. You then look at what part of the code take up most of the excecution time, and try to improve those first. Before profiling my code was around 3-4 times slower than it is now. Most of the performance left on the table was due to mistakes such as unnecessary memory copies or iterating over an array in a non-contiguous manner (i.e. column-by-column rather than row-by-row). There are still improvements I can make to my implementation, but I was optimizing the code right before my wife went into labor, and after that I had other priorities for a while.
Rust is a pretty compelling tool for writing fast code as part of a larger Python code base. I like the language, and I am eager to use more of it. I don’t know if the unscented Kalman filter implementation I made will actually end up getting used at my workplace, but it was a nice learning experience regardless. It was also very interesting to actually dive deeper into how (unscented) Kalman filters work, rather than just using them as a tool.
I hope you learned something about Kalman filters, and I’m eager to know how you will use Rust to speed up some of your Python codebse!
]]>In the beginning, I was using Github pages to host this website. However, that meant I didn’t have access to usage stats. Google did offer some analytics, but it was mostly about Google searches. Then, in September, I got myself a VPS to host my own cloud storage and have secure offsite backups, among other things.
This set the stage to finally get to a self-hosted solution back around the new year, and get access to the sweet juicy data. Doing this as well as other sysasdmin things on my VPS made me get comfortable with using Docker to deploy things, and has overall been a great learning experience. Honestly, this was very frustrating at times, and there were multiple times I just gave up. For example, to get familiar with AWS, I wanted to use rdiff-backup
to make backups to an S3 bucket, but after probably spending around 10 hours on this, I just gave up. I think my entire approach was just wrong; but how can you know this in advance? Nevertheless, this website has been running smooth as butter on my VPS, and I love the development workflow.
I started on this project in early March, and it took about two months of working on it on-and-off in my free time to complete. At my job I gained some experience in using Typescript to make front-ends for Python applications and also got acquainted with plotly
as an alternative plotting framework. My initial idea was therefore to do the following:
It started off pretty smoothly. While I took some online courses on SQL databases, I never actually used them in a project, but it was still quite simple to use sqlalchemy
to ingest the logs into a sqllite
database. Since there are never any concurrent users, it didn’t make much sense to me to go for anything more advanced. I then wrote code using pandas
to clean the data and get interesting data out such as the web page the user connects to or the geographic location of each user. I then did data exploration and made some nice time series plots among other things, and I started on making a dashboard front-end using dash
. This is not my first data science project, so this was the easy part.
I did some benchmarking and found that even on just two months’ worth of data most of the plot functions took a couple hundred milliseconds to compute. Since I planned from the start to make the dashboard publicly accessible, I felt optimization was necessary. That’s why I decided to switch from pandas
to polars
; a dataframe library written in rust
more geared towards performance. I found this switch to be very interesting because polars
forces you into a different way of thinking about the data transformations. While this is less intuitive, it does guide you towards solutions that are inherently more performant than what I would come up with when attempting the same thing in pandas
. Another tool in my toolkit.
One thing I really wanted to understand from the data was the differences in geographic regions, but this seems to be very difficult to capture in one or two plots. Instead, I realized that to convey this information properly we need to have some kind of interactivity. I came up with the idea of allowing a user of the dashboard to select different subsets of the dataset which are then displayed in the same plot. For example, suppose I want to know at what time users in Denmark typically connect to my website versus people in Switzerland. I could then make one subset of the dataset containing only the people connecting from Denmark, and another subset of people connecting from Switzerland. After proper normalization, I can then just make a time series of the time of day when users from both subsets of the data connect, and I’ve got my answer. (Shown below; Denmark in blue, Switzerland in orange)
I thus needed to make a component that can be used to interactively select different subsets or ‘filters’ for the dataset. This turned out to be more ambitious and tricky to code than I initially expected. First of all, there is the pure fronted stuff; it took a while to figure out which components to include and how to configure all the CSS and necessary animations for it to work. Since my experience with Typescript and CSS is quite limited, this was probably one of the most challenging parts. I also needed to code quite a bit of interaction between the front-end and the back-end. For example, the front-end and back-end need to agree on which options are allowed in the country selector, and so this has to be communicated from the python back-end. Next, the front-end needs to send all this filter information to the back-end, which of course also requires us to specify a format for this data.
Initially, I tried to do all of this communication purely through dash
, but it quickly became apparent that this falls outside the intended use of this library. Fortunately, it wasn’t hard to switch to using flask
and plotly
directly to get a lot more control over the communication between the front-end and back-end.
And then there were bugs. Many bugs! Why is this div
not centered? Why does this plot overflow on top of another plot? Why doesn’t the size of the plot change when I resize my browser? Why doesn’t changing the date update the plots? And of course, there is a lot of nitpicking over small details; making the size of the button just right; adding some insignificant feature that magically burned 2 hours. It turns out that coding a responsive website is not that easy, and I have a lot of respect for front-end developers.
“It works on my machine” is of course not going to cut it for a website. I first of all needed to make a good docker
container to run the website, and test it properly. This somehow always takes more time than I expect. Building the docker
image and installing all the Python and javascript requirements takes upwards of 3 minutes, and if I make a mistake I need to start all over again.
While I had nice scripts that can ingest the access logs into my database, I also needed to figure out how to deploy this. In the end, I settled for using a logrotate
script that is run once a day and calls a Python script (inside the Docker container hosting the dashboard). This was the first time I ever used log rotation, and getting the config to work took a few tries.
The dashboard is now running at dashboard.rikvoorhaar.com, as kind of a standalone component to my website. It would be really nice to integrate it better into my website, but this seems to be relatively tricky. The rest of my website is completely static, and I’m not sure how to integrate it into my website (other than using an iframe
, which frankly looks awful). Eventually, I plan to make an entirely new website from scratch, and I plan to integrate the dashboard better as well.
This project has been a fantastic learning experience. I don’t think I ever did any other personal project where I learned so many new skills. Just to make a list, I gained experience in all of the following cool technologies:
React
dash
logrotate
npm
plotly
polars
sqlalchemy
webpack
The unsung hero that empowered this learning experience was ChatGPT. Without it, the project would have easily taken 2-3 times longer. When you’re learning something new, you often have no idea where to start even. And that is really where a tool like ChatGPT shines.
Is the dashboard finished?
No. It is not. There are many, many features that I would love to add. But I am probably not going to do it. I’m not going to call this project a time sink, but I am eager to start a new project.
Are you going to use the dashboard?
Maybe a little. Honestly, I’m not quite sure what to do with this information other than stare at it from time to time. This website is not a product, and there is no business value to be gained out of analyzing the access logs.
Can I use it for my own website?
]]>Sure! The entire project’s source code can be found on github. If your logs are in the same format as mine, then deploying it should be relatively simple. Feel free to contact me if you want to do this, and I’d be happy to help.
A few months ago, I defended my thesis and earned the title of “doctor.” I’m excited to share the contents of my thesis defense with you in this blog post, where you can get a glimpse of the fascinating research I conducted over the past few years. The post is divided into several parts, with the level of technical detail increasing gradually.
If you’re interested in reading my full thesis, you can find it here. I’ve also made my defense slides available for download here.
My thesis focuses on low-rank tensors, but to understand them, it’s important to first discuss low-rank matrices. You can learn more about low-rank matrices in this blog post. A low-rank matrix is simply the product of two smaller matrices. For example below we write the matrix \(A\) as the product \(A=XY^\top\).
In this case, the matrix \(X\) is of size \(m\times r\) and the matrix \(Y\) is of size \(n\times r\). This usually means that the product \(A\) is a rank-\(r\) matrix, though it could have a lower rank if for example one of the rows of \(X\) or \(Y\) is zero.
While many matrices encountered in real-world applications are not low-rank, they can often be well approximated by low-rank matrices. Images, for example, can be represented as matrices (if we consider each color channel separately), and low-rank approximations of images can give recognizable results. In the figure below, we can see several low-rank approximations of an image, with higher ranks giving better approximations of the original image.
To determine the “best” rank-\(r\) approximation of a matrix \(A\), we can solve the following optimization problem:
\[\min_{B \text{ rank } \leq r} \|A - B\|\]There are several ways to solve this approximation problem, but luckily in this case there is a simple closed-form solution known as the truncated SVD. To apply this method using numpy, we can use the following code:
def low_rank_approx(A, r):
U, S, Vt = np.linalg.svd(A)
return U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]
One disadvantage of this particular function is that it hides the fact that the output has rank \(\leq r\), since we’re just returning an \(m\times n\) matrix. However, we can fix this easily as follows:
def low_rank_approx(A, r):
U, S, Vt = np.linalg.svd(A)
X = U[:, :r] @ np.diag(S[:r])
Y = Vt[:r, :].T
return X, Y
Low-rank matrices are computationally efficient because they enable fast products. If we have two large \(n\times n\) matrices, it takes \(O(n^3)\) flops to compute their product using conventional matrix multiplication algorithms. However, if we can express the matrices as low-rank products, such as \(A=X_1Y_1^\top\) and \(B=X_2Y_2^\top\), then computing their product requires only \(O(rn^2)\) flops. Even better, the product can be expressed as the product of two \(n\times r\) matrices using only \(O(r^2n)\) flops, which is potentially much less than \(O(n^3)\). Similarly, if we want to multiply a matrix with a vector, a low-rank representation can greatly reduce the computational cost.
The decomposition of a size \(m\times n\) matrix \(A\) into \(A=XY^\top\) uses only \(r(m+n)\) parameters, compared to the \(mn\) parameters required for the full matrix. (In fact, due to symmetry, we only need \(r(m+n) - r^2\) parameters.) This implies that with more than \(r(m+n)\) entries of the matrix known, we can infer the remaining entries of the matrix. This is called matrix completion, and we can achieve it by solving an optimization problem, such as:
\[\min_{B\text{ rank }\leq r}\|\mathcal P_\Omega{A} - \mathcal P_{\Omega}B\|,\]where \(\Omega\) is the set of known entries of \(A\), and \(\mathcal P_\Omega\) is the projection that sets all entries in a matrix not in \(\Omega\) to zero.
Matrix completion is illustrated in two examples of reconstructing an image as a rank 100 matrix, where 2.7% of the pixels were removed. The effectiveness of matrix completion depends on the distribution of the unknown pixels. When the unknown pixels are spread out, matrix completion works well, as seen in the first example. However, when the unknown pixels are clustered in specific regions, matrix completion does not work well, as seen in the second example.
There are various methods to solve the matrix completion problem, and one simple technique is alternating least squares optimization, which I also discussed in this blog post. This approach optimizes the matrices \(X\) and \(Y\) alternately, given the decomposition \(B=XY^\top\). Another interesting method is solving a slightly different optimization problem which turns out to be convex, which can thus b solved using the machinery of convex optimization. Another effective method is Riemannian gradient descent, which is also useful for low-rank tensors. The idea is to treat the set of low-rank matrices as a Riemannian manifold, and then use gradient descent methods. The gradient is projected onto the tangent space of the manifold, and then a step is taken in the projected direction, which keeps us closer to the constraint set. The projection back onto the manifold is usually cheap to compute, and can be combined with the step into a single operation known as a retraction. The challenge of Riemannian gradient descent is to find a retraction that is both cheap to compute and effective for optimizing the objective.
Let’s now move on to the basics of tensors. A tensor is a multi-dimensional array, with a vector being an order 1 tensor and a matrix being an order 2 tensor. An order 3 tensor can be thought of as a collection of matrices or an array whose entries can be represented by a cube of values. Unfortunately, this geometric way of thinking about tensors breaks down at higher orders, but in principle still works.
Some examples of tensors include:
Recall that a matrix is low-rank if it is the product of two smaller matrices, that is, \(A=XY^\top\). Unfortunately, this notation doesn’t generalize well to tensors. Instead, we can write down each entry of \(A\) as a sum: \(A[i,j] = \sum_{\ell=1}^rX[i,\ell]Y[j,\ell]\)
Similarly, we could write an order 3 tensor as a product of 3 matrices as follows:
\[A[i, j, k] = \sum_{\ell=1}^r X[i,\ell]Y[j,\ell]Z[k,\ell]\]However, if we’re dealing with more complicated tensors of higher order, this kind of notation can quickly become unwieldy. One way to get around this is to use a diagrammatic notation, where tensors are represented by boxes with one leg (edge) for each of the tensor’s indices. Connecting two boxes via one of these legs denotes summation over the associated index. For example, matrix multiplication is denoted as follows:
To make it clearer which legs can be contracted together, it’s helpful to label them with the dimension of the associated index; it is only possible to sum over an index belonging to two different tensors if they have the same dimension.
We can for example use the following diagram to depict the low-rank order 3 tensor described above:
In this case, we sum over the same index for three different matrices, so we connect the three legs together in the diagram. This resulting low-rank tensor is called a “CP tensor”, where “CP” stands for “canonical polyadic”. This tensor format can be easily generalized to higher order tensors. For an order-d tensor, we can represent it as:
\[A[i_1,i_2,\dots,i_d] = \sum_{\ell=1}^r X_1[i_1,\ell]X_2[i_2,\ell]\cdots X_d[i_d,\ell]\]The CP tensor format is a natural generalization of low-rank matrices and is straightforward to formulate. However, finding a good approximation of a given tensor as a CP tensor of a specific rank can be difficult, unlike matrices where we can use truncated SVD to solve this problem. To overcome this limitation, we can use a slightly more complex tensor format known as the “tensor train format” (TT).
Let’s consider the formula for a low-rank matrix decomposition once again: \(A=C_1C_2\)
We can express this as follows:
\[A[i_1, i_2] = \sum_{\ell}^r C_1[i_1,\ell]C_2[\ell,i_2]\]Alternatively, we can rewrite this as the product of row \(i_1\) of the first matrix and column \(i_2\) of the second matrix, like so: \(A[i_1,i_2] = C_1[i_1,:]C_2[:,i_2]\) This can be visualized as follows:
To extend this to an order-3 tensor, we can represent \(A[i_1,i_2,i_3]\) as the product of 3 vectors, which is known as the CP tensor format. Alternatively, we can express \(A[i_1,i_2,i_3]\) as a vector-matrix-vector product, like so:
\[\begin{align} A[i_1,i_2,i_3] &= C_1[i_1,:]C_2[:,i_2,:]C_3[:,i_3]\\ &= \sum_{\ell_1=1}^{r_1}\sum_{\ell_2=1}^{r_2} C_1[i_1,\ell_1]C_2[\ell_1,i_2,\ell_2]C_3[\ell_2,i_3] \end{align}\]This can be represented visually as shown below:
Extending this to an arbitrary order is straightforward. For example, for an order-4 tensor, we would write each entry of the tensor as a vector-matrix-matrix-vector product, like so:
\[\begin{align} A[i_1,i_2,i_3,i_4] &= C_1[i_1,:]C_2[:,i_2,:]C_3[:,i_3,:]C_4[:,i_4]\\ &= \sum_{\ell_1=1}^{r_1}\sum_{\ell_2=1}^{r_2} \sum_{\ell_3=1}^{r_3} C_1[i_1,\ell_1]C_2[\ell_1,i_2,\ell_2]C_3[\ell_2,i_3,\ell_3] C_4[\ell_3,i_4]. \end{align}\]which can be depicted like this:
Let’s translate the formula back into diagrammatic notation. We want to represent an order 4 tensor as a box with four legs, expressed as the product of a matrix, two order 3 tensors, and another matrix. This is the resulting diagram:
An arbitrary tensor train can be denoted as follows:
This notation helps us understand why this decomposition is called a tensor train. Each box of order 2/3 tensors represents a “carriage” in the train. We can translate the above diagram into a train shape, like this:
Although I am not skilled at drawing, we can use stable diffusion to create a more aesthetically pleasing depiction:
From what we have seen so far it is not obvious what makes the tensor train decomposition such a useful tool. Although these properties are not unique to the tensor train decomposition, here are some reasons why it is a good decomposition for many applications.
Computing entries is fast: Computing an arbitrary entry \(A[i_1,\dots,i_d]\) is very fast, requiring just a few matrix-vector products. These operations can be efficiently done in parallel using a GPU as well.
Easy to implement: Most algorithms involving tensor trains are not difficult to implement, which makes them easier to adopt. A similar tensor decomposition known as the hierarchical tucker decomposition is much more tricky to use in practical code, which is likely why it is less popular than tensor trains despite theoretically being a superior format for many purposes.
Dimensional scaling: If we keep the ranks \(r = r_1=\dots=r_{d-1}\) of an order-\(d\) tensor train fixed, then the amount of data required to store and manipulate a tensor train only scales linearly with the order of the tensor. A dense tensor format would scale exponentially with the tensor order and quickly become unmanageable, so this is an important property. Another way to phrase this is that tensor trains do not suffer from the curse of dimensionality.
Orthogonality and rounding: Tensor trains can be orthogonalized with respect to any mode. They can also be rounded, i.e. we can lower all the ranks of the tensor train. These two operations are extremely useful for many algorithms and have a reasonable computational cost of \(O(r^3nd)\) flops, and are also very simple to implement.
Nice Riemannian structure: The tensor trains of a fixed maximum rank form a Riemannian manifold. The tangent space, and orthogonal projections onto this tangent space, are relatively easy to work with and compute. The manifold is also topologically closed, which means that optimization problems on this manifold are well-posed. These properties allow for some very efficient Riemannian optimization algorithms.
With the understanding that low-rank tensors can effectively represent discretized functions, I will demonstrate how tensor trains can be utilized to create a unique type of machine learning estimator. To avoid redundancy, I will provide a condensed summary of the topic, and I invite you to read my more detailed blog post on this subject if you would like to learn more.
Let’s consider a function \(f(x,y)\colon I^2\to \mathbb R\) and plot its values on a square. For instance, we can use the following function:
\[f(x,y) = 3\cos(10(x^2 + y^2/2)) -\sin(20(2x-y))/2\]Note that grayscale images can be represented as matrices, so if we use \(m\times n\) pixels to plot the function, we get an \(m\times n\) matrix. Surprisingly, this matrix is always rank-4, irrespective of its size. We illustrate the rows of matrices \(X\) and \(Y\) of the low-rank decomposition \(A=XY^\top\) below. Notice that increasing the matrix size doesn’t visibly alter the low-rank decomposition.
This suggests that low-rank matrices can potentially capture complex 2D functions. We can extend this to higher dimensions by using low-rank tensors to represent intricate functions.
We can use low-rank tensors to parametrize complicated functions using relatively few parameters, which makes them suitable as a supervised learning model. Suppose we have a few samples \(y_j = \hat f(x_j)\) with \(j=1,\dots, N\) for \(x_j\in\mathbb R^{d}\), where \(\hat f\) is an unknown function. Let \(f_A\) be the discretized function obtained from a tensor or matrix \(A\). We can formulate supervised learning as the following least-squares problem:
\[\min_{A} \sum_{j=1}^N (f_A(x_j) - y_j)^2\]Each data point \(x_j\) corresponds to an entry \(A[i_1(x_j),\dots,i_d(x_j)]\), which allows us to rephrase the least-squares problem as a matrix/tensor completion problem.
Let’s see this in action for the 2D/matrix case to gain some intuition. First, let’s generate some random points in a square and sample the function \(f(x,y)\) defined above. On the left, we see a scatterplot of the random value samples, and next, we see what this looks like as a discretized function/matrix.
If we now apply matrix completion to this, we get the following. First, we see the completed matrix using a rank-8 matrix, and then the matrices \(X, Y\) of the decomposition \(A=XY^\top\).
What we have so far is already a useable supervised learning model; we plug in data, and as output, it can make reasonably accurate predictions of data points it hasn’t seen so far. However, the data used to train this model is uniformly distributed across the domain. Real data is rarely like that, and if we plot the same for images for less uniformly distributed data the result is less impressive:
How can we get around this? Well, if the data is not uniform, then why should we use a uniform discretization? For technical reasons, the discretization is still required to be a grid, but we can adjust the spacing of the grid points to better match the data. If we do this, we get something like this:
While the final function (3rd plot) may look odd, it does achieve two important things. First, it makes the matrix-completion problem easier because we start with a matrix where a larger percentage of entries are known. And secondly, the resulting function is accurate in the vicinity of the data points. So long as the distribution of the training data is reasonably similar to the test data, this means that the model is accurate on most test data points. The model is potentially not very accurate in some regions, but this may simply not matter in practice.
Now, let’s dive into the high-dimensional case and examine how tensor trains can be employed for supervised learning. In contrast to low-rank matrices, we will utilize tensor trains to parameterize the discretized functions. This involves solving an optimization problem of the form:
\(\min_{ A\in \mathscr M}\sum_{j=1}^N\left(A[i_1(\mathbf x_j),\dots, i_d(\mathbf x_j)]-y_j\right)^2,\tag{\)\star\(}\)
where \(\mathscr M\) denotes the manifold of all tensor trains with a given maximum rank. To tackle this optimization problem effectively, we can use the Riemannian structure of the tensor train manifold. This approach results in an optimization algorithm similar to gradient descent (with line search) but utilizing Riemannian gradients instead.
Unfortunately, the problem \((\star)\) is very non-linear and the objective has many local minima. As a result, any gradient-based method will only produce good results if it has a good initialization. The ideal initialization is a tensor that describes a discretized function with low training/test loss. Fortunately, we can easily obtain such tensors by training a different machine learning model (such as a random forest or neural network) on the same data and then discretizing the resulting function. However, this gives us a dense tensor, which is impractical to store in memory. We can still compute any particular entry of this tensor cheaply, equivalent to evaluating the model at a point. Using a technique called TT-cross, we can efficiently obtain a tensor-train approximation of the discretized machine learning model, which we can then use as initialization for the optimization routine.
But why go through all this trouble? Why not just use the initialization model instead of the tensor train? The answer lies in the speed and size of the resulting model. The tensor train model is much smaller and faster than the model it is based on, and accessing any entry in a tensor train is extremely fast and easy to parallelize. Moreover, low-rank tensor trains can still parameterize complicated functions.
To summarize the potential advantages of TTML, consider the following three graphs (for more details, please refer to my thesis):
Based on the results shown in the graphs, it is clear that the TTML model has a significant advantage over other models in terms of size and speed. It is much smaller and faster than other models while maintaining a similar level of test error. However, it is important to note that the performance of the TTML model may depend on the specific dataset used, and in many practical machine learning problems, its test error may not be as impressive as in the experiment shown. That being said, if speed is a crucial factor in a particular application, the TTML model can be a very competitive option.
As we have seen above, the singular value decomposition (SVD) can be used to find the best low-rank approximation of any matrix. Unfortunately, the SVD is rather expensive to compute, costing \(O(mn^2)\) flops for an \(m\times n\) matrix. Moreover, while SVD can also be used to compute good low-rank TT approximations of any tensor, the cost of the SVD can become prohibitively expensive in this context. Therefore, we need a faster way to compute low-rank matrix approximations.
In my blog post I discussed some iterative methods to compute low-rank approximations using only matrix-vector products. However, there are even faster non-iterative methods that are based on multiplying the matrix of interest with a random matrix.
Specifically, if \(A\) is a rank-\(\hat r\) matrix of size \(m\times n\) and \(x\) is a random matrix of size \(r>\hat r\), then it turns out that the product \(AX\) almost always has the same range as \(A\). This is because multiplying by \(X\) like this doesn’t change the rank of \(A\) unless \(X\) is chosen adversarially. However, since we assume \(X\) is chosen randomly this almost never happens. And here ‘almost never’ is meant in the mathematical sense – i.e., with probability zero. As a result, we have the identity \(\mathcal P_{AX}A =A\), where \(\mathcal P_{AX}\) denotes the orthogonal projection onto \(AX\). This projection matrix can be computed using the QR decomposition of \(AX\), or it can be seen simply as the matrix whose columns form an orthogonal basis of the range of \(AX\).
If \(A\) has rank \(\hat r>r\) however, then \(\mathcal P_{AX}A\neq A\). Nevertheless, we might hope that these two matrices are close, i.e. we may hope that (with probability 1) we have
\[\|\mathcal P_{AX}A \leq C\|A_{\hat r}-A\|,\]for some constant \(C\) that depends only on \(\hat r\) and the dimensions of the problem. Recall that here \(A_{\hat r}\) denotes the best rank-\(\hat r\) approximation of \(A\) (which we can compute using the SVD). It turns out that this is true, and it gives a very simple algorithm for computing a low-rank approximation of a matrix using only \(O(mnr)\) flops – a huge gain if \(r\) is much smaller than the size of the matrix. This is known as the Halko-Martinsson-Tropp (HMT) method, and can be implemented in Python like this:
def hmt_approximation(A, r):
m, n = A.shape
X = np.random.normal(size=(n, r))
AX = A @ X
Q, _ = np.linalg.qr(AX)
return Q, Q.T @ A
Since \(Q(Q^\top A) = \mathcal P_{AX}A\), this gives a low-rank approximation. It can also be used to obtain an approximate truncated SVD with a minor modification: if we take the SVD \(U\Sigma V^\top = Q^\top A\), then \((QU)\Sigma V^\top\) is an approximate truncated SVD of \(A\). In Python we could implement this like this:
def hmt_truncated_svd(A, r):
Q, QtA = hmt_approximation(A, r)
U, S, Vt = np.linalg.svd(QtA)
return Q @ U, S, Vt
The HMT method, while efficient, has some drawbacks compared to other randomized methods. For instance, it cannot compute a low-rank decomposition of the sum \(A+B\) of two matrices in parallel since the QR decomposition of \((A+B)X\) requires the computation of \((A+B)X\) first. Additionally, if a low-rank approximation \(Q(Q^\top A)\) has already been computed and a small change \(B\) is made to \(A\) to obtain \(A' = A + B\), it is not possible to compute an approximation of the same rank for \(A'\) without redoing most of the work.
The issue arises from the fact that computing a QR decomposition of the product \(AX\) is nonlinear and expensive. One way to address this is by introducing a second random matrix \(Y\) of size \(m\times r\) and computing a decomposition of \(Y^\top AX\). This matrix has a much smaller size of \(r\times r\), allowing for efficient computations if \(r\) is small. Furthermore, computing \(Y^\top(A+B)X\) can be performed entirely in parallel. This results in a low-rank decomposition of the form:
\[A\approx AX(Y^\top AX)^\dagger Y^\top A,\]where \(\dagger\) denotes the pseudo-inverse. This is a generalization of the matrix inverse to non-invertible or rectangular matrices, and it can be computed using the SVD of a matrix. If \(A=U\Sigma V^\top\), then \(A^\dagger = V\Sigma^{-1} U^\top\). To compute the inverse of \(\Sigma\), we set \(\Sigma^{-1}[i,i] = 1/(\Sigma[i,i])\) unless one of the diagonal entries is zero, in which case it remains untouched.
The randomized decomposition we discussed is known by different names and has appeared in slightly different forms many times in the literature. In my thesis, I refer to it as the “generalized Nyström” (GN) method. Like the HMT method, it is “quasi-optimal”, which means that it satisfies:
\[\|AX(Y^\top AX)^\dagger Y^\top A - A\|\leq C\|A_{\hat r}-A\|.\]However, there are two technical caveats that need to be discussed. The first is that if we need to choose X and Y to be of different sizes; that is, we must have X of size \(m \times r_R\) and Y of size \(n \times r_L\) with \(r_L \neq r_R\). This is because otherwise \((Y^\top AX)^\dagger\) can have strange behavior. For example, the expected spectral norm \(\mathbb{E} \|(Y^\top AX)^\dagger\|_2\) is infinite if \(r_L=r_R\).
The second caveat is that explicitly computing \((Y^\top AX)^\dagger\) and then multiplying it by \(AX\) and \(Y^\top A\) can lead to numerical instability. However, a product of the form \(A^\dagger B\) is equivalent to the solution of a linear problem of the form \(AX=B\). As a result, we could implement this method in Python as follows:
def generalized_nystrom(A, rank_left, rank_right):
m, n = A.shape
X = np.random.normal(size=(n, rank_right))
Y = np.random.normal(size=(m, rank_left))
AX = A @ X
YtA = Y.T @ A
YtAX = Y.T @ AX
L = YtA
R = np.linalg.solve(YtAX, AX, rcond=None)
return L, R
Note that this method computes the decomposition implicitly by solving a linear system of equations, which is more stable and efficient than explicitly computing the pseudo-inverse.
Next, we will see how to generalize the GN method to a method for tensor trains. Unfortunately this will get a little technical. Recall that for the GN decomposition we had a decomposition of form
\[AX(Y^\top AX)Y^\top A\]To generalize the GN method to the tensor case, we can take a matricization of the tensor \(\mathcal{T}\) with respect to a mode \(\mu\), which gives a matrix \(\mathcal{T}^{\leq \mu}\) of size \((n_1 \cdots n_\mu) \times (n_{\mu+1} \cdots n_d)\). We can then multiply this matrix on the left and right with random matrices \(Y_\mu\) and \(X_\mu\) of size \((n_1 \cdots n_\mu) \times r_L\) and \((n_{\mu+1} \cdots n_d) \times r_R\), respectively, to obtain the product
\[\Omega_\mu := Y_\mu^\top \mathcal T^{\leq \mu} X_\mu.\]We call this product a ‘sketch’, and this particular sketch corresponds to the product \(Y^\top AX\) in the matrix case. We visualize the computation of this sketch below:
If we think of a matrix as a tensor with only two modes (legs), then the products \(AX\) and \(Y^\top A\) correspond to multiplying the tensor by matrices such that ‘one mode is left alone’. From this perspective, we can generalize these products to the sketch
\[\Psi_\mu := (Y_{\mu-1}\otimes I_{n_\mu})^\top \mathcal T^{\leq \mu} X_\mu,\]where \(Y_{\mu-1}\) is now a matrix of size \((n_1\dots n_{\mu-1})\times r_L\), also depicted below:
By extension, we can define \(Y_0=X_d=1\), and then the definition of \(\Psi_\mu\) reduces to \(\Psi_1=AX\) and \(\Psi_2=Y^\top A\) in the matrix case. We can therefore rewrite the GN method as
\[AX(Y^\top AX)Y^\top A = \Psi_1\Omega_1^\dagger \Psi_2\]More generally, we can chain the sketches \(\Omega_\mu\) and \(\Psi_\mu\) together to form a tensor network of the following form:
With only minor work, we can turn this tensor network into a tensor train. It turns out that this defines a very useful approximation method for tensor trains. However, it may not be immediately clear why this method gives an approximation to the original tensor. To gain some insight, we can rewrite this decomposition into a different form. We can then see that this approximation boils down to successively applying a series of projections to the original tensor. However, the proof of this fact, as well as the error analysis, is outside the scope of this blog post.
We call this decomposition the streaming tensor train approximation, and it has several nice properties. First of all, as its name suggests, it is a streaming method. This means that if we have a tensor that decomposes as \(\mathcal T = \mathcal T_1+\mathcal T_2\), then we can compute the approximation for \(\mathcal T_1\) and \(\mathcal T_2\) completely independently, and only spend a small amount of effort at the end of the procedure to combine the results. This is because all the sketches \(\Omega_\mu\) and \(\Psi_\mu\) are linear in the input tensor, and the final step of computing a tensor train from these sketches is very cheap (and in fact even optional).
The decomposition is also quasi-optimal. This means that the approximation of this error will (with high probability) lie within a constant factor of the error of the best possible approximation. Unlike the matrix case, however, it is not possible in general to compute the best possible approximation itself in a reasonable time.
The cost of computing this decomposition varies depending on the type of tensor being used. It is easy to derive the cost for the case where \(\mathcal T\) is a ‘dense’ tensor, i.e. just a multidimensional array. However, it rarely makes sense to apply a method like that to such a tensor; usually, we apply it to tensors that are far too big to even store in memory. Instead, we assume that the tensor already has some structure. For example, \(\mathcal T\) could be a CP tensor, a sparse tensor or even a tensor train itself. For each type of tensor, we can then derive a fast way to compute this decomposition, especially if we allow the matrices \(X_\mu\) and \(Y_\mu\) to be structured tensors. The exact implementation of this is however a little technical to discuss here, but safe to say the method is quite fast in practice.
In addition to this decomposition, we also invented a second kind of decomposition (called OTTS) that is somewhat of a hybrid generalization of the GN and HMT methods. It is no longer a streaming method, but it is in certain cases significantly more accurate than the previous method, and it can be applied in almost all of the same cases. Finally, there is also a generalization (called TT-HMT) of the HMT method to tensor trains that already existed in the literature for a few years that also works in most of the same situations but is also not a streaming method.
Below we compare these three methods – STTA, OTTS and TT-HMT – to a generalization of the truncated SVD (TT-SVD). The latter method is generally expensive to compute but has a very good approximation error, making it an excellent benchmark.
In the plot above we have taken a 10x10x10x10x10 CP tensor and computed TT approximations of different ranks. What we see is that all methods have similar behavior, and are ordered from best to worst approximation as TT-SVD > OTTS > TT-HMT > STTA. This order is something we observe in general across many different experiments, and also in terms of theoretical approximation error. Furthermore, even though these last three methods are all randomized, the variation in approximation error is relatively small, especially for larger ranks. Next, we consider another experiment below:
Here we compare the scaling of the error of the different approximation methods as the order of the tensor increases exponentially. The tensor that we’re approximating, in this case, is always a tensor train with a fast decay in its singular values, and the approximation error is always relative to the TT-SVD method. This is because for such a tensor it is possible to compute the TT-SVD approximation in a reasonable time. While all models have similar error scaling, we see that OTTS is closest in performance to TT-SVD. We can thus conclude that all these methods have their merits; we can use STTA if we work with a stream of data, OTTS if we want a good approximation (especially for very big tensors), and TT-HMT if we just want a good approximation and care more about speed than quality.
This sums up, on a high level, what I did in my thesis. There are plenty of things I could still talk about and plenty of details left out, but this blog post is already quite long and technical. If you’re interested to learn more, you’re welcome to read my thesis or send me a message.
My PhD was a long and fun ride, and I’m now looking back on it with nothing but fondness. I enjoyed my time in Geneva, and I’m going to miss some aspects of the PhD life. I have now started a ‘real’ job, and the things I’m working has very little overlap with the contents of my thesis. However, I hope I will be able to discuss some of the cool things I’m working on now, as well as some new personal projects.
]]>The linear least-squares problem is one of the most common minimization problems we encounter. It takes the following form:
\[\min_x \|Ax-b\|^2\]Here \(A\) is an \(n\times n\) matrix, and \(x,b\in\mathbb R^{n}\) are vectors. If \(A\) is invertible, then this problem has a simple, unique solution: \(x = A^{-1}b\). However, there are two big reasons why we should almost never use \(A^{-1}\) to solve the least-squares problem in practice:
Assuming \(A\) doesn’t have any useful structure, point 1. is not that bad. Solving the least-squares problem in a smart way costs \(O(n^3)\), and doing it using matrix-inversion also costs \(O(n^3)\), just with a larger hidden constant. The real killer is the instability. To see this in action, let’s take a matrix that is almost singular, and see what happens when we solve the least-squares problem.
import numpy as np
np.random.seed(179)
n = 20
# Create almost singular matrix
A = np.eye(n)
A[0, 0] = 1e-20
A = A @ np.random.normal(size=A.shape)
# Random vector b
b = A @ np.random.normal(size=(n,)) + 1e-3 * np.random.normal(size=n)
# Solve least-squares with inverse
A_inv = np.linalg.inv(A)
x = A_inv @ b
error = np.linalg.norm(A @ x - b) ** 2
print(f"error for matrix inversion method: {error:.4e}")
# Solve least-squares with dedicated routine
x = np.linalg.lstsq(A, b, rcond=None)[0]
error = np.linalg.norm(A @ x - b) ** 2
print(f"error for dedicated method: {error:.4e}")
error for matrix inversion method: 3.6223e+02
error for dedicated method: 2.8275e-08
In this case we took a 20x20 matrix \(A\) with ones on the diagonals, except for one entry where it has value
1e-20
, and then we shuffled everything around by multiplying by a random matrix. The entries of \(A\) are
not so big, but the entries of \(A^{-1}\) will be gigantic. This results in the fact that the solution
obtained as \(x=A^{-1}b\) does not satisfy \(Ax=b\) in practice. The solution found by using the np.linalg.lstsq
routine is much better.
The reason that the inverse-matrix method fails badly in this case can be summarized using the condition number \(\kappa(A)\). It expresses how much the error \(\|Ax-b\|\) with \(x=A^{-1}b\) is going to change if we change \(b\) slightly, in the worst case. The condition number gives a notion of how much numerical errors get amplified when we solve the linear system. We can compute it as the ratio between the smallest and largest singular values of the matrix \(A\):
\[\kappa(A) = \sigma_1(A) / \sigma_n(A)\]In the case above the condition number is really big:
np.linalg.cond(A)
1.1807555508404976e+16
Large condition numbers mean that any numerical method is going to struggle to give a good solution, but for numerically unstable methods the problem is a lot worse.
While the numerical stability of algorithms is a fascinating topic, it is not what we came here for today. Instead, let’s revisit the first reason why using matrix inversion for solving linear problems is bad. I mentioned that matrix inversion and better alternatives take \(O(n^3)\) to solve the least squares problem \(\min_a\|Ax-b\|^2\), if there is no extra structure on \(A\) that we can exploit.
What if there is such structure? For example, what if \(A\) is a huge sparse matrix? For example the Netflix dataset we considered in this blog post is of size 480,189 x 17,769. Putting aside the fact that it is not square, inverting matrices of that kind of size is infeasible. Moreover, the inverse matrix isn’t necessarily sparse anymore, so we lose that valuable structure as well.
Another example arose in my first post on deconvolution. There we tried to solve the linear problem
\[\min_x \|k * x -y\|^2\]where \(k * x\) denotes convolution. Convolution is a linear operation, but requires only \(O(n\log n)\) to compute, whereas writing it out as a matrix would require \(n\times n\) entries, which can quickly become too large.
In situations like this, we have no choice but to devise an algorithm that makes use of the structure of \(A\). What the two situations above have in common is that storing \(A\) as a dense matrix is expensive, but computing matrix-vector products \(Ax\) is cheap. The algorithm we are going to come up with is going to be iterative; we start with some initial guess \(x_0\), and then improve it until we find a solution of the desired accuracy.
We don’t have much to work with; we have a vector \(x_0\) and the ability fo compute matrix-vector products. Crucially, we assumed our matrix \(A\) is square. This means that \(x_0\) and \(Ax_0\) have the same shape, and therefore we can also compute \(A^2x_0\), or in fact \(A^rx_0\) for any \(r\). The idea is then to try to express the solution to the least-squares problem as linear combination of the vectors
\[\mathcal K_r(A,x_0):=\{x_0, Ax_0,A^2x_0,\ldots,A^{r-1}x_0\}.\]This results in a class of algorithms known as Krylov subspace methods. Before diving further into how they work, let’s see one in action. We take a 2500 x 2500 sparse matrix with 5000 non-zero entries (which includes the entire diagonal).
import scipy.sparse
import scipy.sparse.linalg
import matplotlib.pyplot as plt
from time import perf_counter_ns
np.random.seed(179)
n = 2500
N = n
shape = (n, n)
# Create random sparse (n, n) matrix with N non-zero entries
coords = np.random.choice(n * n, size=N, replace=False)
coords = np.unravel_index(coords, shape)
values = np.random.normal(size=N)
A_sparse = scipy.sparse.coo_matrix((values, coords), shape=shape)
A_sparse = A_sparse.tocsr()
A_sparse += scipy.sparse.eye(n)
A_dense = A_sparse.toarray()
b = np.random.normal(size=n)
b = A_sparse @ b
# Solve using np.linalg.lstsq
time_before = perf_counter_ns()
x = np.linalg.lstsq(A_dense, b, rcond=None)[0]
time_taken = (perf_counter_ns() - time_before) * 1e-6
error = np.linalg.norm(A_dense @ x - b) ** 2
print(f"Using dense solver: error: {error:.4e} in time {time_taken:.1f}ms")
# Solve using inverse matrix
time_before = perf_counter_ns()
x = np.linalg.inv(A_dense) @ x
time_taken = (perf_counter_ns() - time_before) * 1e-6
error = np.linalg.norm(A_dense @ x - b) ** 2
print(f"Using matrix inversion: error: {error:.4e} in time {time_taken:.1f}ms")
# Solve using GMRES
time_before = perf_counter_ns()
x = scipy.sparse.linalg.gmres(A_sparse, b, tol=1e-8)[0]
time_taken = (perf_counter_ns() - time_before) * 1e-6
error = np.linalg.norm(A_sparse @ x - b) ** 2
print(f"Using sparse solver: error: {error:.4e} in time {time_taken:.1f}ms")
Using dense solver: error: 1.4449e-25 in time 2941.5ms
Using matrix inversion: error: 2.4763e+03 in time 507.0ms
Using sparse solver: error: 2.5325e-13 in time 6.4ms
As we see above, the sparse matrix solver solves this problem in a fraction of the time, and the difference is just going to get bigger with larger matrices. Above we use the GMRES routine, and it is very simple. It constructs an orthonormal basis of the Krylov subspace \(\mathcal K_m(A,x_0)\), and then finds the best solution in this subspace by solving a small \((m+1)\times m\) linear system. Before figuring out the details, below is a simple implementation:
def gmres(linear_map, b, x0, n_iter):
# Initialization
n = x0.shape[0]
H = np.zeros((n_iter + 1, n_iter))
r0 = b - linear_map(x0)
beta = np.linalg.norm(r0)
V = np.zeros((n_iter + 1, n))
V[0] = r0 / beta
for j in range(n_iter):
# Compute next Krylov vector
w = linear_map(V[j])
# Gram-Schmidt orthogonalization
for i in range(j + 1):
H[i, j] = np.dot(w, V[i])
w -= H[i, j] * V[i]
H[j + 1, j] = np.linalg.norm(w)
# Add new vector to basis
V[j + 1] = w / H[j + 1, j]
# Find best approximation in the basis V
e1 = np.zeros(n_iter + 1)
e1[0] = beta
y = np.linalg.lstsq(H, e1, rcond=None)[0]
# Convert result back to full basis and return
x_new = x0 + V[:-1].T @ y
return x_new
# Try out the GMRES routine
time_before = perf_counter_ns()
x0 = np.zeros(n)
linear_map = lambda x: A_sparse @ x
x = gmres(linear_map, b, x0, 50)
time_taken = (perf_counter_ns() - time_before) * 1e-6
error = np.linalg.norm(A_sparse @ x - b) ** 2
print(f"Using GMRES: error: {error:.4e} in time {time_taken:.1f}ms")
Using GMRES: error: 1.1039e-15 in time 12.9ms
This clearly works; it’s not as fast as the scipy
implementation of the same algorithm, but we’ll do something about that soon.
Let’s take a more detailed look at what the GMRES algorithm is doing. We iteratively define an orthonormal basis \(V_m = \{v_0,v_1,\dots,v_{m-1}\}\). We start with \(v_0 = r_0 / \|r_0\|\), where \(r_0 = b-Ax_0\) is the residual of the initial guess \(x_0\). In each iteration we then set \(w = A v_j\), and take \(v_{j+1} = w - \sum_i (w^\top v_{i})v_i\); i.e. we ensure \(v_{j+1}\) is orthogonal to all previous \(v_0,\dots,v_j\). Therefore \(V_m\) is an orthonormal basis of the Krylov subspace \(\mathcal K_m(A,r_0)\).
Once we have this basis, we want to solve the minimization problem:
\[\min_{x\in \mathcal K_m(A,r_0)} \|A(x_0+x)-b\|\]Since \(V_m\) is a basis, we can write \(x = V_m y\) for some \(y\in \mathbb R^m\). Also note that in this basis \(b-Ax_0 = r_0 = \beta v_0 = \beta V_m e_1\) where \(\beta = \|r_0\|\) and \(e_1= (1,0,\dots,0)\). This allows us to rewrite the minimization problem:
\[\min_{y\in\mathbb R^m} \|AV_my - \beta V_me_1\|\]To solve this minimization problem we need one more trick. In the algorithm we computed a matrix \(H\), it is defined like this:
\[H_{ij} = v_i^\top (Av_j-\sum_k H_{kj}v_k) = v_i^\top A v_j\]These are precisely the coefficients of the Gram-Schmidt orthogonalization, and hence \(A v_j = \sum_{i=1}^{j+1} H_{ij}v_i\), giving the matrix equality \(AV_m = HV_m\). Now we can rewrite the minimization problem even further and get
\[\min_{y\in\mathbb R^m} \|V_m (Hy - \beta e_1)\| = \min_{y\in\mathbb R^m} \|Hy - \beta e_1\|\]The minimization problem is therefore reduced to an \((m+1)\times m\) problem! The cost of this is \(O(m^3)\), and as long as we don’t use too many steps \(m\), this cost can be very reasonable. After solving for \(y\), we then get the estimate \(x = x_0 + V_m y\).
In the current implementation of GMRES we specify the number of steps in advance, which is not ideal. If we converge to the right solution in less steps, then we are doing unnecessary work. If we don’t get a satisfying solution after the specified number of steps, we might need to start over. This is however not a big problem; we can use the output \(x=x_0+V_my\) as new initialization when we restart.
This gives a nice recipe for GMRES with restarting. We run GMRES for \(m\) steps with \(x_i\) as initialization to get a new estimate \(x_{i+1}\). We then check if \(x_{i+1}\) is good enough, if not, we repeat the GMRES procedure for another \(m\) steps.
It is possible to get a good estimate of the residual norm after each step of GMRES, not just every \(m\) steps. However, this is relatively technical to implement, so we will just consider the variation of GMRES with restarting.
How often should we restart? This really depends on the problem we’re trying to solve, since there is a trade-off. More steps in between each restart will typically result in convergence in fewer steps, but it is more expensive and also requires more memory. The computational cost scales as \(O(m^3)\), and the memory cost scales linearly in \(m\) (if the matrix size \(n\) is much bigger than \(m\)). Let’s see this trade-off in action on a model problem.
Recall that the deconvolution problem is of the following form:
\[\min_x \|k * x -y\|^2\]for a fixed kernel \(k\) and signal \(y\). The convolution operation \(k*x\) is linear in \(x\), and we can therefore treat this as a linear least-squares problem and solve it using GMRES. The operation \(k*x\) can be written in matrix form as \(Kx\), where \(K\) is a matrix. For large images or signals, the matrix \(K\) can be gigantic, and we never want to explicitly store \(K\) in memory. Fortunately, GMRES only cares about matrix-vector products \(Kx\), making this a very good candidate to solve with GMRES.
Let’s consider the problem of sharpening (deconvolving) a 128x128 picture blurred using Gaussian blur. To make the problem more interesting, the kernel \(k\) used for deconvolution will be slightly different from the kernel used for blurring. This is inspired by the blind deconvolution problem, where we not only have to find \(x\), but also the kernel \(k\) itself.
We solve this problem with GMRES using different number of steps between restarts, and plot how the error evolves over time.
from matplotlib import image
from utils import random_motion_blur
from scipy.signal import convolve2d
# Define the Gaussian blur kernel
def gaussian_psf(sigma=1, N=9):
gauss_psf = np.arange(-N // 2 + 1, N // 2 + 1)
gauss_psf = np.exp(-(gauss_psf ** 2) / (2 * sigma ** 2))
gauss_psf = np.einsum("i,j->ij", gauss_psf, gauss_psf)
gauss_psf = gauss_psf / np.sum(gauss_psf)
return gauss_psf
# Load the image and blur it
img = image.imread("imgs/vitus128.png")
gauss_psf_true = gaussian_psf(sigma=1, N=11)
gauss_psf_almost = gaussian_psf(sigma=1.05, N=11)
img_blur = convolve2d(img, gauss_psf_true, mode="same")
# Define the convolution linear map
linear_map = lambda x: convolve2d(
x.reshape(img.shape), gauss_psf_almost, mode="same"
).reshape(-1)
# Apply GMRES for different restart frequencies and measure time taken
total_its = 2000
n_restart_list = [20, 50, 200, 500]
losses_dict = dict()
for n_restart in n_restart_list:
time_before = perf_counter_ns()
b = img_blur.reshape(-1)
x0 = np.zeros_like(b)
x = x0
losses = []
for _ in range(total_its // n_restart):
x = gmres(linear_map, b, x, n_restart)
error = np.linalg.norm(linear_map(x) - b) ** 2
losses.append(error)
time_taken = (perf_counter_ns() - time_before) / 1e9
print(f"Best loss for {n_restart} restart frequency is {error:.4e} in {time_taken:.2f}s")
losses_dict[n_restart] = losses
Best loss for 20 restart frequency is 9.3595e-16 in 11.32s
Best loss for 50 restart frequency is 2.4392e-22 in 11.71s
Best loss for 200 restart frequency is 6.3063e-28 in 17.34s
Best loss for 500 restart frequency is 6.9367e-28 in 30.50s
We observe that with all restart frequencies we converge to a result with very low error. The larger the number of steps between restarts, the faster we converge. Remember however that the cost of GMRES rises as \(O(m^3)\) with the number of steps \(m\) between restarts, so a larger number of steps is not always better. For example we see that \(m=20\) and \(m=50\) produced almost identical runtime, but for \(m=200\) the runtime for 2000 total steps is already significantly bigger, and the effect is even bigger for \(m=500\). This means that if we want to get converge as fast as possible in terms of runtime, we’re best off with somewhere between \(m=50\) and \(m=200\) steps between each reset.
If we do simple profiling, we see that almost all of the time in this function is spent on the 2D convolution. Indeed this is why the runtime does not seem to scale os \(O(m^3)\) for the values of \(m\) we tried above. It simply takes a while before the \(O(m^3)\) factor becomes dominant over the time spent by matrix-vector products.
This also means that it should be straightforward to speed up – we just need to do the convolution on a GPU. It is not as simple as that however; if we just do the convolution on GPU and the rest of the operations on CPU, then the bottleneck quickly becomes moving the data between CPU and GPU (unless we are working on a system where CPU and GPU share memory).
Fortunately the entire GMRES algorithm is not so complex, and we can use hardware acceleration by simply translating the algorithm to use a fast computational library. There are several such libraries available for Python:
In this context CuPy might be the easiest to use; its syntax is very similar to numpy. However, I would also like to make use of JIT (Just-in-time) compilation, particularly since this can limit unnecessary data movement. Furthermore, it really depends on the situation which low-level CUDA functions are best called in different situations (especially for something like convolution), and JIT compilation can offer significant optimizations here.
TensorFlow, DASK and PyTorch are really focussed on machine-learning and neural networks, and the way we interact with these libraries might not be the best for this kind of algorithm. In fact, I tried to make an efficient GMRES implementation using these libraries, and I was really struggling; I feel these libraries simply aren’t the right tool for this job.
Numba is also great, I could basically feed it the code I already wrote and it would probably compile the function and make it several times faster on CPU. Unfortunately, the support for GPU is still lacking quite a bit in Numba, and we would therefore still leave quite a bit of performance on the table.
In the end we will implement it in JAX. Like CuPy, it has an API very similar to numpy which means it’s easy to translation. However, it also supports JIT, meaning we can potentially make much faster functions. Without further ado, let’s implement the GMRES algorithm in JAX and see what kind of speedup we can get.
import jax.numpy as jnp
import jax
# Define the linear operator
img_shape = img.shape
def do_convolution(x):
return jax.scipy.signal.convolve2d(
x.reshape(img_shape), gauss_psf_almost, mode="same"
).reshape(-1)
def gmres_jax(linear_map, b, x0, n_iter):
# Initialization
n = x0.shape[0]
r0 = b - linear_map(x0)
beta = jnp.linalg.norm(r0)
V = jnp.zeros((n_iter + 1, n))
V = V.at[0].set(r0 / beta)
H = jnp.zeros((n_iter + 1, n_iter))
def loop_body(j, pair):
"""
One basic step of GMRES; compute new Krylov vector and orthogonalize.
"""
H, V = pair
w = linear_map(V[j])
h = V @ w
v = w - (V.T) @ h
v_norm = jnp.linalg.norm(v)
H = H.at[:, j].set(h)
H = H.at[j + 1, j].set(v_norm)
V = V.at[j + 1].set(v / v_norm)
return H, V
# Do n_iter iterations of basic GMRES step
H, V = jax.lax.fori_loop(0, n_iter, loop_body, (H, V))
# Solve the linear system in the basis V
e1 = jnp.zeros(n_iter + 1)
e1 = e1.at[0].set(beta)
y = jnp.linalg.lstsq(H, e1, rcond=None)[0]
# Convert result back to full basis and return
x_new = x0 + V[:-1].T @ y
return x_new
b = img_blur.reshape(-1)
x0 = jnp.zeros_like(b)
x = x0
n_restart = 50
# Declare JIT compiled version of gmres_jax
gmres_jit = jax.jit(gmres_jax, static_argnums=[0, 3])
print("Compiling function:")
%time x = gmres_jit(do_convolution, b, x0, n_restart).block_until_ready()
print("\nProfiling functions. numpy version:")
%timeit x = gmres(linear_map, b, x0, n_restart)
print("\nProfiling functions. JAX version:")
%timeit x = gmres_jit(do_convolution, b, x0, n_restart).block_until_ready()
Compiling function:
CPU times: user 1.94 s, sys: 578 ms, total: 2.51 s
Wall time: 2.01 s
Profiling functions. numpy version:
263 ms ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Profiling functions. JAX version:
9.16 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
With the JAX version running on my GPU, we get a 30x times speedup! Not bad, if you ask me. If we run the same code on CPU, we still get a 4x speedup. This means that the version compiled by JAX is already faster in its own right.
The code above may look a bit strange, and there are definitely some things that might need some explanation.
First of all, note that the first time we call gmres_jit
it takes much longer than the subsequent calls.
This is because the function is JIT – just in time compiled. On the first call, JAX runs through the entire
function and makes a big graph of all the operations that need to be done, it then optimizes (simplifies) this
graph, and compiles it to create a very fast function. This compilation step obviously takes some time, but
the great thing is that we only need to do it once.
Note the way we create the function gmres_jit
:
gmres_jit = jax.jit(gmres_jax, static_argnums=[0, 3])
Here we tell JAX that if the first or the fourth argument changes, the function needs to be recompiled. This is because both these arguments are python literals (the first is a function, the fourth is the number of iterations), whereas the other two arguments are arrays.
The shape of the arrays V
and H
depend on the last argument n_iter
. However, the compiler needs to know the shape of these arrays at compile time. Therefore, we need to recompile the function every time that n_iter
changes. The same is true for the linear_map
argument; the
shape of the vector w
depends on linear_map
in principle.
Next, consider the fact that there is no more for
loop in the code, and it is instead replaced by
H, V = jax.lax.fori_loop(0, n_iter, loop_body, (H, V))
We could in fact use a for loop here as well, and it would give an identical result but it would take much
longer to compile. The reason for this is that, as mentioned, JAX runs through the entire function and makes a
graph of all the operations that need to be done. If we leave in the for loop, then each iteration of the loop
would add more and more operations to the graph (the loop is ‘unrolled’), making a really big graph. By using
jax.lax.fori_loop
we can skip this, and end up with a much smaller graph to be compiled.
One disadvantage of this approach is that the size of all arrays needs to be known at compile time. In the
original algorithm we did not compute (V.T) @ h
for example, but rather (V[:j+1].T) @ h
. Now we can’t do
that, because the size of V[:j+1]
is not known at compile time. The end result ends up being the same
because at iteration j
, we have V[j+1:] = 0
. This actually means that over all the iterations of j
we
end up doing about double the work for this particular operation. However, because the operation is so much
faster on a GPU this is not a big problem.
As we can see, writing code for GPUs requires a bit more thought than writing code for CPUs. Sometimes we even end up with less efficient code, but this can be entirely offset by the improved speed of the GPU.
We see above that GMRES provides a very fast and accurate solution to the deconvolution problem. This has a lot to do with the fact that the convolution matrix is very well-conditioned. We can see this by looking at the singular of this matrix. The convolution matrix for a 128x128 image is a bit too big to work with, but we can see what happens for 32x32 images.
N = 11
psf = gaussian_psf(sigma=1, N=N)
img_shape = (32, 32)
def create_conv_mat(psf, img_shape):
tot_dim = np.prod(img_shape)
def apply_psf(signal):
signal = signal.reshape(img_shape)
return convolve2d(signal, psf, mode="same").reshape(-1)
conv_mat = np.zeros((tot_dim, tot_dim))
for i in range(tot_dim):
signal = np.zeros(tot_dim)
signal[i] = 1
conv_mat[i] = apply_psf(signal)
return conv_mat
conv_mat = create_conv_mat(psf, img_shape)
svdvals = scipy.linalg.svdvals(conv_mat)
plt.plot(svdvals)
plt.yscale('log')
cond_num = svdvals[0]/svdvals[-1]
plt.title(f"Singular values. Condition number: {cond_num:.0f}")
As we can see, the condition number is only 4409, which makes the matrix very well-conditioned. Moreover, the singular values decay somewhat gradually. What’s more, the convolution matrix is actually symmetric and positive definite. This makes the linear system relatively easy to solve, and explains why it works so well.
This is because the kernel we use – the Gaussian kernel – is itself symmetric. For a non-symmetric kernel, the situation is more complicated. Below we show what happens for a non-symmetric kernel, the same type as we used before in the blind deconvolution series of blog posts.
from utils import random_motion_blur
N = 11
psf_gaussian = gaussian_psf(sigma=1, N=N)
psf = random_motion_blur(
N=N, num_steps=20, beta=0.98, vel_scale=0.1, sigma=0.5, seed=42
)
img_shape = (32, 32)
# plot the kernels
plt.figure(figsize=(8, 4.5))
plt.subplot(1, 2, 1)
plt.imshow(psf_gaussian)
plt.title("Gaussian kernel")
plt.subplot(1, 2, 2)
plt.imshow(psf)
plt.title("Non-symmetric kernel")
plt.show()
# study convolution matrix
conv_mat = create_conv_mat(psf, img_shape)
plt.show()
eigs = scipy.linalg.eigvals(conv_mat)
plt.title(f"Eigenvalues")
plt.ylabel("Imaginary part")
plt.xlabel("Real part")
plt.scatter(np.real(eigs), np.imag(eigs), marker=".")
We see that the eigenvalues of this convolution matrix are distributed around zero. The convolution matrix for the gaussian kernel is symmetric and positive definite – all eigenvalues are positive real numbers. GMRES works really well when almost all eigenvalues lie in an ellipse not containing zero. That is clearly not the case here, and we in fact also see that GMRES doesn’t work well for this particular problem. (Note that we now switch to 256x256 images instead of 128x128, since our new implementation of GMRES is much faster)
img = image.imread("imgs/vitus256.png")
psf = random_motion_blur(
N=N, num_steps=20, beta=0.98, vel_scale=0.1, sigma=0.5, seed=42
)
img_blur = convolve2d(img, psf, mode="same")
img_shape = img.shape
def do_convolution(x):
res = jax.scipy.signal.convolve2d(
x.reshape(img_shape), psf, mode="same"
).reshape(-1)
return res
b = img_blur.reshape(-1)
x0 = jnp.zeros_like(b)
x = x0
n_restart = 1000
n_its = 10
losses = []
for _ in range(n_its):
x = gmres_jit(do_convolution, b, x, n_restart)
error = np.linalg.norm(do_convolution(x) - b) ** 2
losses.append(error)
Not does it take much more iterations to converge, the final result is unsatisfactory at best. Clearly without further modifications the GMRES method doesn’t work well for deconvolution for non-symmetric kernels.
As mentioned, GMRES works best when the eigenvalues of the matrix \(A\) are in an ellipse not including zero, which is not the case for our convolution matrix. There is fortunately a very simple solution to this: instead of solving the linear least-squares problem
\[\min_x \|Ax - b\|_2^2\]We solve the linear least squares problem
\[\min_x \|A^\top A x - A^\top b\|^2\]This will have the same solution, but the eigenvalues of \(A^\top A\) are better behaved. Any matrix like this will be positive semi-definite, and all eigenvalues will be real and non-negative. They therefore all fit inside an ellipse that doesn’t include zero, and we will get much better convergence with GMRES. In general, we could multiply by any matrix \(B\) to obtain the linear least-squares problem
\[\min_x \|BAX-Bb\|^2\]If we choose \(B\) such that the spectrum (eigenvalues) of \(BA\) are nicer, then we can improve convergence of GMRES. This trick is called preconditioning. Choosing a good preconditioner depends a lot on the problem at hand, and is the subject of a lot of research. In this context, \(A^\top\) turns out to function as an excellent preconditioner, as we shall see.
To apply this trick to the deconvolution problem, we need to be able to take the transpose of the convolution operation. Fortunately, this is equivalent to convolution with a reflected version \(\overline k\) of the kernel \(k\). That is, we will apply GMRES to the linear least-squares problem
\[\min_x \|\overline k *(k*x) - \overline k * y\|\]let’s see this in action below.
img = image.imread("imgs/vitus256.png")
psf = random_motion_blur(
N=N, num_steps=20, beta=0.98, vel_scale=0.1, sigma=0.5, seed=42
)
psf_reversed = psf[::-1, ::-1]
img_blur = convolve2d(img, psf, mode="same")
img_shape = img.shape
def do_convolution(x):
res = jax.scipy.signal.convolve2d(x.reshape(img_shape), psf, mode="same")
res = jax.scipy.signal.convolve2d(res, psf_reversed, mode="same")
return res.reshape(-1)
b = jax.scipy.signal.convolve2d(img_blur, psf_reversed, mode="same").reshape(-1)
x0 = jnp.zeros_like(b)
x = x0
n_restart = 100
n_its = 20
# run once to compile
gmres_jit(do_convolution, b, x, n_restart)
time_start = perf_counter_ns()
losses = []
for _ in range(n_its):
x = gmres_jit(do_convolution, b, x, n_restart)
error = np.linalg.norm(do_convolution(x) - b) ** 2
losses.append(error)
time_taken = (perf_counter_ns() - time_start) / 1e9
print(f"Deconvolution in {time_taken:.2f} s")
Deconvolution in 1.40 s
Except for some ringing around the edges, this produces very good result. Compared to other methods of deconvolution (as discussed in this blog post) this in fact shows much less ringing artifacts. It’s pretty fast as well. Even though it takes us around 2000 iterations to converge, the differences between the image after 50 steps or 2000 steps is not that big visually speaking. Let’s see how the solution develops with different numbers of iterations:
x0 = jnp.zeros_like(b)
x = x0
results_dict = {}
for n_its in [1, 5, 10, 20, 50, 100]:
x0 = jnp.zeros_like(b)
# run once to compile
gmres_jit(do_convolution, b, x0, n_its)
time_start = perf_counter_ns()
for _ in range(10):
x = gmres_jit(do_convolution, b, x0, n_its)
time_taken = (perf_counter_ns() - time_start) / 1e7
results_dict[n_its] = (x, time_taken)
After just 100 iterations the result is pretty good, and this takes just 64ms. This makes it a viable method for deconvolution, roughly equally as fast as Richardson-Lucy deconvolution, but suffering less from boundary artifacts. The regularization methods we have discussed in the deconvolution blog posts also work in this setting, and are good to use in the case where there is noise, or where we don’t precisely know the convolution kernel. That is however out of the scope of this blog post.
GMRES is an easy to implement, fast and robust method for solving structured linear system, where we only have access to matrix-vector products \(Ax\). It is often used for solving sparse systems, but as we have demonstrated, it can also be used for solving the deconvolution problem in a way that is competitive with existing methods. Sometimes a preconditioner is needed to get good performance out of GMRES, but choosing a good preconditioner can be difficult. If we implement GMRES on a GPU it can reach much higher speeds than on CPU.
]]>Before we dive into the details of our new type of machine learning model, let’s sit back for a moment and think: what is machine learning in the first place? Machine learning is all about learning from data. More specifically in supervised machine learning we are given some data points \(X = (x_1,\dots,x_N)\), all lying in \(\mathbb R^d\), together with labels \(y=(y_1,\dots,y_N)\) which are just numbers. We then want to find some function \(f\colon \mathbb R^d\to \mathbb R\) such that \(f(x_i)\approx y_i\) for all \(i\), and such that \(f\) generalizes well to new data. Or rather, we want to minimize a loss function, for example the least-squares loss
\[L(f) = \sum_{i=1}^N (f(x_i)-y)^2.\]This is obviously an ill-posed problem, and there are two main issues with it:
The first issue has no general solution. We choose some class of functions, usually that depend on some set of parameters \(\theta\). For example, if we want to fit a quadratic function to our data we only look at quadratic functions
\[f_{(a,b,c)}(x) = a + bx +cx^2,\]and our set of parameters is \(\theta=\{a,b,c\}\). Then we minimize the loss over this set of parameters, i.e. we solve the minimization problem:
\[\min_{a,b,c} \sum_{i=1}^N (a+ bx_i+cx_i^2-y_i)^2.\]There are many parametric families \(f_\theta\) of functions we can choose from, and many different ways to solve the corresponding minimization problem. For example, we can choose \(f_\theta\) to be neural networks with some specified layer sizes, or a random forest with a fixed number of trees and fixed maximum tree depth. Note that we should strictly speaking always specify hyperparameters like the size of the layers of a neural network, since those hyperparameters determine what kind of parameters \(\theta\) we are going to optimize. That is, hyperparameters affect the parametric family of functions that we are going to optimize.
The second issue, generalization, is typically solved through cross-validation. If we want to know whether the function \(f_\theta\) we learned generalizes well to new data points, we should just keep part of the data “hidden” during the training (the test data). After training we then evaluate our trained function on this hidden data, and we record the loss function on this test data to obtain the test loss. The test loss is then a good measure of how well the function can generalize to new data, and it is very useful if we want to compare several different functions trained on the same data. Typically we use a third set of data, the validation dataset for optimizing hyperparameters for example, see my blog post on the topic.
Keeping the general problem of machine learning in mind, let’s consider a particular class of parametric functions: discretized functions on a grid. To understand this class of functions, we first look at the 1D case. Let’s take the interval \([0,1]\), and chop it up into \(n\) equal pieces:
\[[0,1] = [0,1/n]\cup[1/n,2/n]\cup\dots\cup[(n-1)/n,1]\]A discretized function is then one that takes a constant value on each subinterval. For example, below is a discretized version of a sine function:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
DEFAULT_FIGSIZE = (10, 6)
plt.figure(figsize=(DEFAULT_FIGSIZE))
num_intervals = 10
num_plotpoints = 1000
x = np.linspace(0, 1 - 1 / num_plotpoints, num_plotpoints)
def f(x):
return np.sin(x * 2 * np.pi)
plt.plot(x, f(x), label="original function")
plt.plot(
x,
f((np.floor(x * num_intervals) + 0.5) / num_intervals),
label="discretized function",
)
plt.legend();
Note that if we divide the interval into \(n\) pieces, then we need \(n\) parameters to describe the discretized function \(f_\theta\).
In the 2D case we instead divide the square \([0,1]^2\) into a grid, and demand that a discretized function is constant on each grid cell. If we use \(n\) grid cells for each axis, this gives us \(n^2\) parameters. Let’s see what a discretized function looks like in a 3D plot:
fig = plt.figure(figsize=(DEFAULT_FIGSIZE))
num_plotpoints = 200
num_intervals = 5
def f(X, Y):
return X + 2 * Y + 1.5 * ((X - 0.5) ** 2 + (Y - 0.5) ** 2)
X_plotpoints, Y_plotpoints = np.meshgrid(
np.linspace(0, 1 - 1 / num_plotpoints, num_plotpoints),
np.linspace(0, 1 - 1 / num_plotpoints, num_plotpoints),
)
# Smooth plot
Z_smooth = f(X_plotpoints, Y_plotpoints)
ax = fig.add_subplot(121, projection="3d")
ax.plot_surface(X_plotpoints, Y_plotpoints, Z_smooth, cmap="inferno")
plt.title("original function")
# Discrete plot
X_discrete = (np.floor(X_plotpoints * num_intervals) + 0.5) / num_intervals
Y_discrete = (np.floor(Y_plotpoints * num_intervals) + 0.5) / num_intervals
Z_discrete = f(X_discrete, Y_discrete)
ax = fig.add_subplot(122, projection="3d")
ax.plot_surface(X_plotpoints, Y_plotpoints, Z_discrete, cmap="inferno")
plt.title("discretized function");
Before diving into higher-dimensional versions of discretized functions, let’s think about how we would solve the learning problem. As mentioned, we have \(n^2\) parameters, and we can encode this using an \(n\times n\) matrix \(\Theta\). We are doing supervised machine learning, so we have data points \(((x_1,y_1),\dots,(x_N,y_N))\) and corresponding labels \((z_1,\dots,z_N)\). Each data point \((x_i,y_i)\) correspond to some entry \((j,k)\) in the matrix \(\Theta\); this is simply determined by the specific grid cell the data point happens to fall in.
If the points \(((x_{i_1},y_{i_1}),\dots,(x_{i_m},y_{i_m}))\) all fall into the grid cell \((j,k)\), then we can define \(\Theta[j,k]\) simply by the mean value of the labels for these points;
\[\Theta[j,k] = \frac{1}{m} \sum_{a=1}^n y_a\]But what do we do if we have no training data corresponding to some entry \(\Theta[j,k]\)? Then the only thing we can do is make an educated guess based on the entries of the matrix we do know. This is the matrix completion problem; we are presented with a matrix with some known entries, and we are tasked to find good values for the unknown entries. We described this problem in some detail in the previous blog post.
The main takeaway is this: to solve the matrix completion problem, we need to assume that the matrix has some extra structure. We typically assume that the matrix is of low rank \(r\), that is, we can write \(\Theta\) as a product \(\Theta=A B\) where \(A,B\) are of size \(n\times r\) and \(r\times n\) respectively. Intuitively, this is a useful assumption because now we only have to learn \(2nr\) parameters instead of \(n^2\). If \(r\) is much smaller than \(n\), then this is a clear gain.
From the perspective of machine learning, this changes the class of functions we are considering. Instead of all discretized functions on our \(n\times n\) grid inside \([0,1]^2\), we now consider only those functions described by a matrix \(\Theta=AB\) that has rank at most \(r\). This also changes the parameters; instead of \(n^2\) parameters, we now only consider \(2nr^2\) parameters describing the two matrices \(A,B\).
Real data is often not uniform, so unless we use a very coarse grid, some entries of \(\Theta[j,k]\) are always going to be unknown. For example below we show some more realistic data, with the same function as before plus some noise. The color indicates the value of the function \(f\) we’re trying to learn.
num_intervals = 8
N = 50
# A function to make somewhat realistic looking 2D data
def non_uniform_data(N):
np.random.seed(179)
X = np.random.uniform(size=N)
X = (X + 0.5) ** 2
X = np.mod(X ** 5 + 0.2, 1)
Y = np.random.uniform(size=N)
Y = (Y + 0.5) ** 3
Y = np.sin(Y * 0.2 * np.pi + 1) + 1
Y = np.mod(Y + 0.6, 1)
X = np.mod(X + 3 * Y + 0.5, 1)
Y = np.mod(0.3 * X + 1.3 * Y + 0.5, 1)
X = X ** 2 + 0.4
X = np.mod(X, 1)
Y = Y ** 2 + 0.5
Y = np.mod(Y + X + 0.4, 1)
return X, Y
# The function we want to model
def f(X, Y):
return X + 2 * Y + 1.5 * ((X - 0.5) ** 2 + (Y - 0.5) ** 2)
X_train, Y_train = non_uniform_data(N)
X_test, Y_test = non_uniform_data(N)
Z_train = f(X_train, Y_train) + np.random.normal(size=X_train.shape) * 0.2
Z_test = f(X_test, Y_test) + np.random.normal(size=X_test.shape) * 0.2
plt.figure(figsize=(7, 6))
plt.scatter(X_train, Y_train, c=Z_train, s=50, cmap="inferno", zorder=3)
plt.colorbar()
# Plot a grid
X_grid = np.linspace(1 / num_intervals, 1, num_intervals)
Y_grid = np.linspace(1 / num_intervals, 1, num_intervals)
plt.xlim(0, 1)
plt.ylim(0, 1)
for perc in X_grid:
plt.axvline(perc, c="gray")
for perc in Y_grid:
plt.axhline(perc, c="gray")
We plotted an 8x8 grid on top of the data. We can see that in some grid squares we have a lot of data points, whereas in other squares there’s no data at all. Let’s try to fit a discretized function described by an 8x8 matrix of rank 3 to this data. We can do this using the ttml package I developed.
from ttml.tensor_train import TensorTrain
from ttml.tt_rlinesearch import TTLS
rank = 3
# Indices of the matrix Theta for each data point
idx_train = np.stack(
[np.searchsorted(X_grid, X_train), np.searchsorted(Y_grid, Y_train)], axis=1
)
idx_test = np.stack(
[np.searchsorted(X_grid, X_test), np.searchsorted(Y_grid, Y_test)], axis=1
)
# Initialize random rank 3 matrix
np.random.seed(179)
low_rank_matrix = TensorTrain.random((num_intervals, num_intervals), rank)
# Optimize the matrix using iterative method
optimizer = TTLS(low_rank_matrix, Z_train, idx_train)
train_losses = []
test_losses = []
for i in range(50):
train_loss, _, _ = optimizer.step()
train_losses.append(train_loss)
test_loss = optimizer.loss(y=Z_test, idx=idx_test)
test_losses.append(test_loss)
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(train_losses, label="Training loss")
plt.plot(test_losses, label="Test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend()
plt.yscale("log")
print(f"Final training loss: {train_loss:.4f}")
print(f"Final test loss: {test_loss:.4f}")
Final training loss: 0.0252
Final test loss: 0.0424
Above we see how the train and test loss develops during training. At first both train and test loss decrease rapidly. Then both train and test loss start to decrease much more slowly, and training loss is less than test loss. This means that the model overfits on the training data, but this is not necessarily a problem; the question is how much it overfits compared to other models. To see how good this model is, let’s compare it to a random forest.
from sklearn.ensemble import RandomForestRegressor
np.random.seed(179)
forest = RandomForestRegressor()
forest.fit(np.stack([X_train, Y_train], axis=1), Z_train)
Z_pred = forest.predict(np.stack([X_test, Y_test], axis=1))
test_loss = np.mean((Z_pred - Z_test) ** 2)
print(f"Random forest test loss: {test_loss:.4f}")
Random forest test loss: 0.0369
We see that the random forest is a little better than the discretized function. And in fact, most standard machine learning estimators will beat a discretized function like this. This is essentially because the discretized function is very simple, and more complicated estimators can do a better job describing the data.
Does this mean that we should stop caring about the discretized function? Test loss is not the only criterion we should use to compare different estimators. Discretized functions like these have two big advantages:
This makes them excellent candidates for low-memory applications. For example, we may want to implement a machine learning model for a very cheap consumer device. If we don’t need extreme accuracy, and we pre-train the model on a more powerful device, discretized functions can be a very attractive option.
The generalization to \(d\)-dimensions is now straightforward; we take a \(d\)-dimensional grid on \([0,1]^d\), with \(n\) subdivisions in each axis. Then we specify the value of our function \(f_\Theta\) on each of the \(n^d\) grid cells. These \(n^d\) values form a tensor \(\Theta\), i.e. a \(d\)-dimensional array. We access the entries of \(\Theta\) with a \(d\)-tuple of indices \(\Theta[i_1,i_2,\dots,i_d]\).
This suffers from the same problems as in the 2D case; the tensor \(\Theta\) is really big, and during training we would need at least one data point for each entry of the tensor. But the situation is even worse, even storing \(\Theta\) can be prohibitively expensive. For example, if \(d=10\) and \(n=20\); then we would need about 82 TB just to store the tensor! In fact, \(n=20\) grid points in each direction is not even that much, so in practice we might need a much bigger tensor still.
In the 2D case we solved this problem by storing the matrix as the product of two smaller matrices. In the 2D case this doesn’t actually save that much on memory, and we mainly did it so that we can solve the matrix completion problem; that is, so that we can actually fit the discretized function to data. In higher dimensions however, storing the tensor in the right way can save immense amounts of space.
In the 2D case, we store matrices as a low rank matrix; as a product of two smaller matrices. But what is the correct analogue of ‘low rank’ for tensors? Unfortunately (or fortunately), there are many answers to this question. There are many ‘low rank tensor formats’, all with very different properties. We will be focusing on tensor trains. A tensor train decomposition of an \(n_1\times n_2\times \dots \times n_d\) tensor \(\Theta\) consists of a set of \(d\) cores \(C_k\) of shape \(r_{k-1}\times n_k \times r_k\), where \((r_1,\dots,r_{d-1})\) are the ranks of the tensor train. Using these cores we can then express the entries of \(\Theta\) using the following formula:
\[\Theta[i_1,\dots,i_d] = \sum_{k_1,\dots,k_{d-1}}C_1[1,i_1,k_1]C_2[k_1,i_2,k_2]\cdots C_{d-1}[k_{d-2},i_{d-1},k_{d-1}]C_d[k_{d-1},i_{d},1]\]This may look intimidating, but the idea is actually quite simple. We should think of the core \(C_{k}\) as a collection of \(n_k\) matrices \((C_k[1],\dots,C_k[n_k])\), each of shape \(r_{k-1}\times r_k\). The index \(i_k\) then selects which of these matrices to use. The first and last cores are special, by convention \(r_0=r_d=1\), this means that \(C_1\) is a collection of \(1\times r_1\) matrices, i.e. (row) vectors. Similarly, \(C_d\) is a collection of \(r_{d-1}\times 1\) matrices, i.e. (column) vectors. Thus each entry of \(\Theta\) is determined by a product like this:
row vector * matrix * matrix * … * matrix * column vector
The result is a number, since a row/column vector times a matrix is a row/column vector, and the product of a row and column vector is just a number. In fact, if we think about it, this is exactly how a low-rank matrix decomposition works as well. If we write a matrix \(\Theta = AB\), then
\[\Theta[i,j]=\sum_k A[i,k] B[k,j] = A[i,:]\cdot B[:,j].\]Here \(A[i,:]\) is a row of \(A\), and \(B[:,j]\) is a column of \(B\). In other words, \(A\) is just a collection of row vectors, and \(B\) is just a collection of column vectors. Then to obtain an entry \(\Theta[i,j]\), we select the \(i\text{th}\) row of \(A\) and the \(j\text{th}\) column of \(B\) and take the product.
In summary, a tensor train is a way to cheaply store large tensors. Assuming all ranks \((r_1,\dots,r_{d-1})\) are the same, a tensor train requires \(O(dr^2n)\) entries to store a tensor with \(O(n^d)\) entries; a huge gain if \(d\) and \(n\) are big. For context, if \(d=10\), \(n=20\), and \(r=10\) then instead of 82 TB we just need 131 KB to store the tensor; that’s about 9 orders of magnitude cheaper! Furthermore, computing entries of this tensor is cheap; it’s just a couple matrix-vector products.
There is obviously a catch to this. Just like not every matrix is low-rank, not every tensor can be represented by a low-rank tensor train. The point, however, is that tensor trains can efficiently represent many tensors that we do care about. In particular, they are good at representing the tensors required for discretized functions.
How can we learn a discretized function \([0,1]^d\to \mathbb R\) represented by a tensor train? Like in the matrix case, many entries of the the tensor are unobserved, and we have to complete these entries based on the entries that we can estimate. In my post on matrix completion we have seen that even the matrix case is tricky, and there are many algorithms to solve the problem. One thing these algorithms have in common is that they are iterative algorithms minimizing some loss function. Let’s derive such an algorithm for tensor train completion.
First of all, what is the loss function we want to minimize during training? It’s simply the least squares loss:
\[L(\Theta) = \sum_{j=1}^N(f_\Theta(x_j) - y_j)^2\]Each data point \(x_j\in [0,1]^d\) fits into some grid cell given by index \((i_1[j],i_2[j],\dots,i_d[j])\), so using the definition of the tensor train the loss \(L(\Theta)\) becomes
\[\begin{align*} L(\Theta) &= \sum_{j=1}^N (\Theta[i_1[j],i_2[j],\dots,i_d[j]] - y_j)^2\\ &= \sum_{j=1}^N(C_1[1,i_1[j],:]C_2[:,i_2[j],:]\cdots C_d[:,i_d[j],1] - y_j)^2 \end{align*}\]A straightforward approach to minimizing \(L(\Theta)\) is to just use gradient descent. We could compute the derivatives with respect to each of the cores \(C_i\) and just update the cores using this derivative. This is, however, very slow. There are two reasons for this, but they are a bit subtle:
To efficiently optimize \(L(\theta)\), we can’t just use gradient descent as-is, and we are forced to walk a different route. While \(L(\Theta)\) is very non-linear as function of the tensor train cores \(C_i\), it is only quadratic in the entries of \(L(\Theta)\), and we can easily compute its derivative:
\[\nabla_{\Theta}L(\Theta) = 2\sum_{j=1}^N (\Theta[i_1[j],i_2[j],\dots,i_d[j]] - y_j)E(i_1[j],i_2[j],\dots,i_d[j]),\]where \(E(i_1,i_2,\dots,i_d)\) denotes a sparse tensor that’s zero in all entries except \((i_1,\dots,i_d)\) where it takes value \(1\). In other words, \(\nabla_{\Theta}L(\Theta)\) is a sparse tensor that is both simple and cheap to compute; it just requires sampling at most \(N\) entries of \(\Theta\). For gradient descent we would then update \(\Theta\) by \(\Theta-\alpha \nabla_{\Theta}L(\Theta)\) with \(\alpha\) the stepsize. Unfortunately, this expression is not a tensor train. However, we can try to approximate \(\Theta-\alpha \nabla_{\Theta}L(\Theta)\) by a tensor train of the same rank as \(\Theta\).
Recall that we can approximate a matrix \(A\) by a rank \(r\) matrix by using the truncated SVD of \(A\). In fact this is the best-possible approximation of \(A\) by a rank \(\leq r\) matrix. There is a similar procedure for tensor trains; we can approximate a tensor \(\Theta\) by a rank \((r_1,\dots,r_{d-1})\) tensor train using the TT-SVD procedure. While this is not the best approximation of \(\Theta\) by such a tensor train, it is ‘quasi-optimal’ and pretty good in practice. The details of the TT-SVD procedure are a little involved, so let’s leave it as a black box. We now have the following iterative procedure for optimizing \(L(\Theta)\):
\[\Theta_{k+1} \leftarrow \operatorname{TT-SVD}(\Theta_{k}-\alpha \nabla_{\Theta}L(\Theta_k) )\]If you’re familiar with optimizing neural networks, you might notice that this procedure could work very well with stochastic gradient descent. Indeed \(\nabla_{\Theta}L(\Theta)\) is a sum over all the data points, so we can just pick a subset of data points (a minibatch) to obtain a stochastic gradient. The reason we would want to do this is that we have so many data points that the cost of each step is dominated by computing the gradient. In this situation this is however not true, and the cost is dominated by the TT-SVD procedure. We therefore stick to more classical gradient descent methods. In particular, the function \(L(\theta)\) can be optimized well with conjugate gradient descent using Armijo backtracking line search.
Let’s now see all of this in practice. Let’s train a discretized function \(f_\Theta\) represented by a tensor
train on some data using the technique described above. We will do this on a real dataset: the airfoil
self-noise dataset. This NASA dataset contains
experimental data about the self-noise of airfoils in a wind tunnel, originally used to optimize wing shapes.
We can do the fitting and optimization using my ttml
package. Let’s use a rank 5 tensor train with 10 grid
points for each feature.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
# Load the data
airfoil_data = pd.read_csv(
"airfoil_self_noise.dat", sep="\t", header=None
).to_numpy()
y = airfoil_data[:, 5]
X = airfoil_data[:, :5]
N, d = X.shape
print(f"Dataset has {N=} samples and {d=} features.")
# Do train-test split, and scale data to interval [0,1]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=179
)
scaler = MinMaxScaler(clip=True)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Define grid, and find associated indices for each data point
num_intervals = 10
grids = [np.linspace(1 / num_intervals, 1, num_intervals) for _ in range(d)]
tensor_shape = tuple(len(grid) for grid in grids)
idx_train = np.stack(
[np.searchsorted(grid, X_train[:, i]) for i, grid in enumerate(grids)],
axis=1,
)
idx_test = np.stack(
[np.searchsorted(grid, X_test[:, i]) for i, grid in enumerate(grids)],
axis=1,
)
# Initialize the tensor train
np.random.seed(179)
rank = 5
tensor_train = TensorTrain.random(tensor_shape, rank)
# Optimize the tensor train using iterative method
optimizer = TTLS(tensor_train, y_train, idx_train)
train_losses = []
test_losses = []
for i in range(100):
train_loss, _, _ = optimizer.step()
train_losses.append(train_loss)
test_loss = optimizer.loss(y=y_test, idx=idx_test)
test_losses.append(test_loss)
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(train_losses, label="Training loss")
plt.plot(test_losses, label="Test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend()
plt.yscale("log")
print(f"Final training loss: {train_loss:.4f}")
print(f"Final test loss: {test_loss:.4f}")
Dataset has N=1503 samples and d=5 features.
Final training loss: 15.3521
Final test loss: 54.4698
We see a similar training profile to the matrix completion case. Let’s see now how this estimator compares to a random forest trained on the same data:
np.random.seed(179)
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)
test_loss = np.mean((y_pred - y_test) ** 2)
print(f"Random forest test loss: {test_loss:.4f}")
Random forest test loss: 3.2568
The random forest has a loss of around 3.3
, but the discretized function has a loss of around 54.5
! That gap in performance is completely unacceptable. We could try to improve it by increasing the number of grid points, and by tweaking the rank of the tensor train. However, it will still come nowhere close to the performance of a random forest, even with its default parameters. Even the training error of the discretized function is much worse than the test error of the random forest.
Why is it so bad? Bad initialization!
Recall that a gradient descent method converges to a local minimum of the function. Usually we hope that whatever local minimum we converge to is ‘good’. Indeed for neural networks we see that, especially if we use a lot of parameters, most local minima found by stochastic gradient descent are quite good, and give a low train and test error. This is not true for our discretized function. We converge to local minima that have both bad train and test error.
The solution? Better initialization!
Instead of initializing the tensor trains randomly, we can learn from other machine learning estimators. We fit our favorite machine learning estimator (e.g. a neural network) to the training data. This gives a function \(g\colon [0,1]^d\to \mathbb R\). This function is defined for any input, not just for the training/test data points. Therefore we can try to first fit our discretized function \(f_\Theta\) to match \(g\), i.e. we solve the following minimization problem:
\[\min_\Theta \|f_\Theta - g\|^2\]One way to solve this minimization problem is by first (randomly) sampling a lot of new data points \((x_1,\dots,x_N)\in [0,1]^d\) and then fitting \(f_\Theta\) to these data points with labels \((g(x_1),\dots,g(x_N))\). This is essentially data augmentation, and can drastically increase the number of data points available for training. With more training data, the function \(f_\Theta\) will indeed converge to a better local minimum.
While data augmentation does improve performance, we can do better. We don’t need to randomly sample data points \((x_1,\dots,x_N)\in[0,1]^d\). Instead we can choose good points to sample; points that give us the most information on how to efficiently update the tensor train. This is essentially the idea behind the tensor train cross approximation algorithm, or TT-Cross for short. Using TT-Cross we can quickly and efficiently get a good approximation to the minimization problem \(\min_\Theta \|f_\Theta - g\|^2\).
We could stop here. If \(g\) models our data really well, and \(f_\Theta\) approximates \(g\) really well, then we should be happy. Like the matrix completion model, discretized functions based on tensor trains are fast and are memory efficient. Therefore we can make an approximation of \(g\) that uses less memory and can make faster predictions! However, the model \(g\) really should be used for initialization only. Usually \(f_\Theta\) actually doesn’t do a great job of approximating \(g\), but if we first approximate \(g\), and then use a gradient descent algorithm to improve \(f_\Theta\) even further, we end up with something much more competitive.
Let’s see this in action. This is actually much easier than what we did before, because I wrote the ttml
package specifically for this use case.
from ttml.ttml import TTMLRegressor
# Use random forest as base estimator
forest = RandomForestRegressor()
# Fit tt on random forest, and then optimize further on training data
np.random.seed(179)
tt = TTMLRegressor(forest, max_rank=5, opt_tol=None)
tt.fit(X_train, y_train, X_val=X_test, y_val=y_test)
y_pred = tt.predict(X_test)
test_loss = np.mean((y_pred - y_test) ** 2)
print(f"TTML test loss: {test_loss:.4f}")
# Forest is fit on same data during fitting of tt
# Let's also report how good the forest does
y_pred_forest = forest.predict(X_test)
test_loss_forest = np.mean((y_pred_forest - y_test) ** 2)
print(f"Random forest test loss: {test_loss_forest:.4f}")
# Training and test loss is also recording during optimization, let's plot it
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(tt.history_["train_loss"], label="Training loss")
plt.plot(tt.history_["val_loss"], label="Test loss")
plt.axhline(test_loss_forest, c="g", ls="--", label="Random forest test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend()
plt.yscale("log")
TTML test loss: 2.8970
Random forest test loss: 3.2568
We see that using a random forest for initialization gives a huge improvement to both training and test loss. In fact,the final test loss is better than that of the random forest itself! On top of that, this estimator doesn’t use many parameters:
print(f"TT uses {tt.ttml_.num_params} parameters")
TT uses 1356 parameters
Let’s compare that to the random forest. If we look under the hood, the scikit-learn implementation of random forests stores 8 parameters per node in each tree in the forest. This is inefficient, and you really only need 2 parameters per node, so let’s use that.
num_params_forest = sum(
[len(tree.tree_.__getstate__()["nodes"]) * 2 for tree in forest.estimators_]
)
print(f"Forest uses {num_params_forest} parameters")
Forest uses 303180 parameters
That’s 1356 parameters vs. more than 300,000 parameters! What about my claim of prediction speed? Let’s compare the amount of time it takes both estimators to predict 1 million samples. We do this by just concatenating the training data until we get 1 million samples.
from time import perf_counter_ns
target_num = int(1e6)
n_copies = int(target_num//len(X_train))+1
X_one_million = np.repeat(X_train,n_copies,axis=0)[:target_num]
print(f"{X_one_million.shape=}")
time_before = perf_counter_ns()
tt.predict(X_one_million)
time_taken = (perf_counter_ns() - time_before)/1e6
print(f"Time taken by TT: {time_taken:.0f}ms")
time_before = perf_counter_ns()
forest.predict(X_one_million)
time_taken = (perf_counter_ns() - time_before)/1e6
print(f"Time taken by Forest: {time_taken:.0f}ms")
X_one_million.shape=(1000000, 5)
Time taken by TT: 430ms
Time taken by Forest: 2328ms
While not by orders of magnitude, we see that the tensor train model is faster. You might be thinking that this is just because the tensor train has fewer parameters, but this is not the case. Even if we use a very high-rank tensor train with high-dimensional data, it is still going to be fast. The speed scales really well, and will beat most conventional machine learning estimators.
With good initialization the model based on distretized functions perform really well. On our test dataset the model is fast, uses few parameters, and beats a random forest in test loss (in fact, it is the best estimator I have found so far for this problem). This is great! I should publish a paper in NeurIPS and get a job at Google! Well… let’s not get ahead of ourselves. It performs well on this particular dataset, yes, but how does it fare on other data?
As we shall see, it doesn’t do all that well actually. The airfoil self-noise dataset is a very particular dataset on which this algorithm excels. The model seems to perform well on data that can be described by a somewhat smooth function, and doesn’t deal well with the noisy and stochastic nature of most data we encounter in the real world. As an example let’s repeat the experiment, but let’s first add some noise:
from ttml.ttml import TTMLRegressor
X_noise_std = 1e-6
X_train_noisy = X_train + np.random.normal(0, X_noise_std, size=X_train.shape)
X_test_noisy = X_test + np.random.normal(scale=X_noise_std, size=X_test.shape)
# Use random forest as base estimator
forest = RandomForestRegressor()
# Fit tt on random forest, and then optimize further on training data
np.random.seed(179)
tt = TTMLRegressor(forest, max_rank=5, opt_tol=None, opt_steps=50)
tt.fit(X_train_noisy, y_train, X_val=X_test_noisy, y_val=y_test)
y_pred = tt.predict(X_test_noisy)
test_loss = np.mean((y_pred - y_test) ** 2)
print(f"TTML test loss (noisy): {test_loss:.4f}")
# Forest is fit on same data during fitting of tt
# Let's also report how good the forest does
y_pred_forest = forest.predict(X_test_noisy)
test_loss_forest = np.mean((y_pred_forest - y_test) ** 2)
print(f"Random forest test loss (noisy): {test_loss_forest:.4f}")
# Training and test loss is also recording during optimization, let's plot it
plt.figure(figsize=(DEFAULT_FIGSIZE))
plt.plot(tt.history_["train_loss"], label="Training loss")
plt.plot(tt.history_["val_loss"], label="Test loss")
plt.axhline(test_loss_forest, c="g", ls="--", label="Random forest test loss")
plt.xlabel("Number of iterations")
plt.ylabel("Loss")
plt.legend();
TTML test loss (noisy): 7.1980
Random forest test loss (noisy): 5.1036
Even a tiny bit of noise in the training data can severely degrade the model. We see that it starts to overfit a lot. This is because my algorithm tries to automatically find a ‘good’ discretization of the data, not just a uniform discretization as we have discussed in our 2D example (i.e. equally spacing all the grid cells). Some of the variables in this dataset are however categorical, and a small amount of noise makes it much more difficult to automatically detect a good way to discretize them.
The model has a lot of hyperparameters we won’t go into now, and playing with them does help with overfitting. Furthermore, the noisy data we show here is perhaps not very realistic. However, the fact remains that the model (at least the way its currently implemented) is not very robust to noise. In particular, the model is very sensitive to the discretization of the feature space used.
Right now we don’t have anything better than simple heuristics for finding discretizations of the features space. Since the loss function depends in a really discontinuous way on the discretization, optimizing the discretization is difficult. Perhaps we can use an algorithm to adaptively split and merge thresholds used in the discretization, or use some kind of clustering algorithm for discretization. I have tried things along those lines but getting it to work well is difficult. I think that with more study, the problem of finding a good discretization can be solved, but it’s not easy.
We looked at discretized functions and their use in supervised machine learning. In higher dimensions discretized functions are parametrized by tensors, which we can represent efficiently using tensor trains. The tensor train can be optimized directly on the data to produce a potentially useful machine learning model. It is both very fast, and doesn’t use many parameters. In order to initialize it well, we can first fit an auxiliary machine learning model on the same data, and then sample predictions from that model to effectively increase the amount of training data. This model performs really well on some datasets, but in general it is not very robust to noise. As a result, without further improvements, the model will only be useful in a select number of cases. On the other hand, I really think that the model does have a lot of potential, once some of its drawbacks are fixed.
]]>numpy
.
Often if we have an \(m\times n\) matrix, we can write it as the product of two smaller matrices. If such a matrix has rank \(r\), then we can write it as the product of an \(m\times r\) and \(r\times n\) matrix. Equivalently, this is the number of linearly independent columns or rows the matrix has, or if we see the matrix as a linear map \(\mathbb R^m\to \mathbb R^n\), then it is the dimension of the image of this linear map.
In practice we can figure out the rank of a matrix by computing its singular value decomposition (SVD). If you studied data science or statistics, then you have probably seen principal component analysis (PCA); this is very closely related to the SVD. Using the SVD we can write a matrix \(X\) as a product
\[X = U S V\]Where \(U\) and \(V\) are orthogonal matrices, and \(S\) is a diagonal matrix. The values on the diagonals of \(S\) are known as the singular values of \(S\). The matrices \(U\) and \(V\) also have nice interpretations; the rows of \(U\) form an orthonormal basis of the row space of \(X\), and the columns of \(V\) are an orthonormal basis of the column space of \(X\).
In numpy
we can compute the SVD of a matrix using np.linalg.svd
. Below we
compute it and verify that indeed \(X = U S V\):
import numpy as np
# Generate a random 10x20 matrix of rank 5
m, n, r = (10, 20, 5)
A = np.random.normal(size=(m, r))
B = np.random.normal(size=(r, n))
X = A @ B
# Compute the SVD
U, S, V = np.linalg.svd(X, full_matrices=False)
# Confirm U S V = X
np.allclose(U @ np.diag(S) @ V, X)
True
Note that we called np.linalg.svd
with the keyword full_matrices=False
. If
left to the default value True
, then in this case V
would be a \({20\times
20}\) matrix, as opposed to the \(10\times 20\) matrix it is now. Also S
is
returned as a 1D array, and we can convert it to a diagonal matrix using
np.diag
. Finally the function np.allclose
checks if all the entries of two
matrices are almost the same; they never will be exactly the same due to
numerical error.
As mentioned before, we can use the singular values S
to determine what the
rank is the matrix X
. This is obvious if we plot the singular values:
import matplotlib.pyplot as plt
DEFAULT_FIGSIZE = (8, 5)
plt.figure(figsize=DEFAULT_FIGSIZE)
plt.plot(np.arange(1, len(S) + 1), S, "o")
plt.xticks(np.arange(1, len(S) + 1))
plt.yscale("log")
plt.title("Plot of singular values")
We see that the first 5 singular values are roughly the same size, but that the last five singular values are much smaller; on the order of the machine epsilon.
Knowing the matrix is rank 5, can we write it as the product of two rank 5
matrices? Absolutely! And we do this using the SVD, or rather the truncated
singular value decomposition. Since the last 5 values of S
are very close to
zero, we can simply ignore them. This then means dropping the last 5 columns of
U
and the last 5 rows of V
. Then finally we just need to ‘absorb’ the
singular values into one of the two matrices U
or V
, This way we write X
as the product of a \(10\times 5\) and \(5\times 20\) matrix.
A = U[:, :r] * S[:r]
B = V[:r, :]
print(A.shape, B.shape)
np.allclose(A @ B, X)
(10, 5) (5, 20)
True
We rarely encounter real-world data that can be exactly represented by a low rank matrix using the truncated SVD. But we can still use the truncated SVD to get a good approximation of the data.
Let us look at the singular values of an image of the St. Vitus church in my hometown. Note that a black-and-white image is really just a matrix.
from matplotlib import image
# Load and plot the St. Vitus image
plt.figure(figsize=(14, 5))
plt.subplot(1, 2, 1)
img = image.imread("vitus512.png")
img = img / np.max(img) # make entries lie in range [0,1]
plt.imshow(img, cmap="gray")
plt.axis("off")
# Compute and plot the singular values
plt.subplot(1, 2, 2)
plt.title("Singular values")
U, S, V = np.linalg.svd(img)
plt.yscale("log")
plt.plot(S)
We see here that the first few singular values are much larger than the rest, followed by a slow decay, and then finally a sharp drop at the very end. Note that there are 512 singular values, because this is a 512x512 image.
Let’s now try to see what happens if we compress this image as a low rank matrix using the truncated singular value decomposition. We will look what happens to the image when seen as a rank 10,20,50 or 100 matrix.
plt.figure(figsize=(12, 12))
for i, rank in enumerate([10, 20, 50, 100]):
# Compute truncated SVD
U, S, V = np.linalg.svd(img)
img_compressed = U[:, :rank] @ np.diag(S[:rank]) @ V[:rank, :]
# Plot the image
plt.subplot(2, 2, i + 1)
plt.title(f"Rank {rank}")
plt.imshow(img_compressed, cmap="gray")
plt.axis("off")
We see that even the rank 10 and 20 images are pretty recognizable, but with heavy artifacts. On the other hand the rank 50 image looks pretty good, but not as good as the original. The rank 100 image on the other hand looks really close to the original.
How big is the compression if we do this? Well, if we write the image as a rank 10 matrix, we need two 512x10 matrices to store the image, which adds up to 10240 parameters, as opposed to the original 262144 parameters; a decrease in storage of more than 25 times! On the other hand, the rank 100 image is only about 2.6 times smaller than the original. Note that this is not a good image compression algorithm; the SVD is relatively expensive to compute, and other compression algorithms can achieve higher compression ratios with less image degradation.
The conclusion we can draw from this is that we can use truncated SVD to compress data. However, not all data can be compressed as efficiently by this method. It depends on the distribution of singular values; the faster the singular values decay, the better a low rank decomposition is going to approximate our data. Images are not good examples of data that can be compressed efficiently as a low rank matrix.
One reason why it’s difficult to compress images is because they contain many sharp edges and transitions. Low rank matrices are especially bad at representing diagonal lines. For example, the identity matrix is a diagonal line seen as an image, and it is also impossible to compress using an SVD since all singular values are equal.
On the other hand, images without any sharp transitions can be approximated quite well using low rank matrices. These kind of images rarely appear as natural images, but rather they can be discrete representations of smooth functions \([0,1]^2 \to\mathbb R\). For example below we show a two-dimensional discretized sum of trigonometric functions and its singular value decomposition.
# Make a grid of 100 x 100 values between [0,1]
x = np.linspace(0, 1, 100)
y = np.linspace(0, 1, 100)
x, y = np.meshgrid(x, y)
# A smooth trigonometric function
def f(x, y):
return np.sin(200 * x + 75 * y) + np.sin(50 * x) + np.cos(100 * y)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
X = f(x, y)
plt.imshow(X)
plt.subplot(1, 2, 2)
U, S, V = np.linalg.svd(X)
plt.plot(S)
plt.yscale("log")
plt.title("Singular values")
print(f"The matrix is approximately of rank: {np.sum(S>1e-12)}")
The matrix is approximately of rank: 4
We see that this particular function can be represented by a rank 4 matrix! This is not obvious if you look at the image. In these kind of situations a low-rank matrix decomposition is much better than many image compression algorithms. In this case we can reconstruct the image using only 8% of the parameters. (Although more advanced image compression algorithms are based on wavelets, and will actually compress this very well.)
Recall that a low rank matrix approximation can require much less parameters than the dense matrix it approximates. One of the powerful things about this allows us to recover the dense matrix even in the case where we only observe a small part of the matrix. That is, if we have many missing values.
In the case above we can represent the 100x100 matrix \(X\) as a product of a 100x4 and 4x100 a matrix \(A\) and \(B\), which has in total 800 parameters instead of 10,000. We can actually recover this low-rank decomposition from a small subset of the big dense matrix. Suppose that we observe the entries \(X_{ij}\) for \((i,j)\in \Omega\) an index set. We can recover \(A\) and \(B\) by solving the following least-squares problem:
\[\min_{A,B}\sum_{(i,j)\in \Omega}((AB)_{ij}-X_{ij})^2\]This problem is however non-convex, and not straightforward to solve. There is fortunately a trick: we can alternatively fix \(A\) and then optimize \(B\) and vice-versa. This is known as Alternating Least Squares (ALS) optimization, and in this case works well. If we fix \(A\), observe that the minimization problem uncouples into separate linear least squares problems for each column of \(B\):
\[\min_{B_{\bullet k}} \sum_{(i,j)\in \Omega,\,j=k} (\langle A_{i\bullet},B_{\bullet k}\rangle-X_{ik})^2\]Below we do use this approach to recover the same matrix as before using 2000 data points, and we can see it does so with a very low error:
N = 2000
n = 100
r = 4
# Sample N=2000 random indices
Omega = np.random.choice(n * n, size=N, replace=False)
Omega = np.unravel_index(Omega, X.shape)
y = X[Omega]
# Use random initialization for matrices A,B
A = np.random.normal(size=(n, r))
B = np.random.normal(size=(r, n))
def linsolve_regular(A, b, lam=1e-4):
"""Solve linear problem A@x = b with Tikhonov regularization / ridge
regression"""
return np.linalg.solve(A.T @ A + lam * np.eye(A.shape[1]), A.T @ b)
losses = []
for i in range(40):
loss = np.mean(((A @ B)[Omega] - y) ** 2)
losses.append(loss)
# Update B
for j in range(n):
B[:, j] = linsolve_regular(A[Omega[0][Omega[1] == j]], y[Omega[1] == j])
# Update A
for i in range(n):
A[i, :] = linsolve_regular(
B[:, Omega[1][Omega[0] == i]].T, y[Omega[0] == i]
)
# Plot the input image
plt.figure(figsize=(12, 12))
plt.subplot(2, 2, 1)
plt.title("Input image")
S = np.zeros((n, n))
S[Omega] = y
plt.imshow(S)
# Plot reconstructed image
plt.subplot(2, 2, 2)
plt.title("Reconstructed image")
plt.imshow(A @ B)
# Plot training loss
plt.subplot(2, 1, 2)
plt.title("Mean square error loss during training")
plt.plot(losses)
plt.yscale("log")
plt.xlabel("steps")
plt.ylabel("Mean squared error")
Let’s consider a particularly interesting use of matrix completion – collaborative filtering. Think about how services like Netflix may recommend new shows or movies to watch. They know which movies you like, and they know which movies other people like. Netflix then recommends movies that are liked by people with a similar taste to yours. This is called collaborative filtering, because different people collaborate to filter out movies so that we can make a recommendation.
But can we do this in practice? Well, for every user we can put their personal ratings of every movie they watched in a big matrix. In this matrix each row represents a movie, and each column a user. Most users have only seen a small fraction of all the movies on the platform, so the overwhelming majority of the entries of this matrix are unknown. Then we apply matrix completion to this matrix. Each entry of the completed matrix then represents the rating we think the user would give to a movie, even if they have never watched it.
In 2006 Netflix opened a competition with a grand prize of $1,000,000 (!!) to solve precisely this problem. The data consists of more than 100 million ratings by 480,189 users for 17,769 different movies. The size of this dataset immediately poses a practical problem; if we put this in a matrix with floating point entries, then it would require about 68 terabytes of RAM. Fortunately we can avoid this problem by using sparse matrices. This makes implementation a little harder, but certainly still feasible.
We will also need to upgrade our matrix completion algorithm. The algorithm we mentioned before is slow for very large matrices, and suffers from problems of numerical stability due to the way it decouples into many smaller linear problems. Recall that complete a matrix \(X\) by solving the following optimization problem:
\[\min_{A,B}\sum_{(i,j)\in \Omega}((AB)_{ij}-X_{ij})^2.\]We will first rewrite the problem as follows:
\[\min_{A,B}\|P_\Omega(AB) -X\|.\]Here \(P_\Omega\) denotes the operation of setting all entries \(AB_{ij}\) to zero if \((i,j)\notin \Omega\). In other words, \(P_\Omega\) turns \(AB\) into a sparse matrix with the same sparsity pattern as \(X\). In some sense, the issue with this optimization problem is that only a small part of the entries of \(AB\) affect the the objective. We can solve this by adding a new matrix \(Z\) such that \(P_\Omega(Z)=X\), and then using \(A,B\) to approximate \(Z\) instead:
\[\min_{A,B,Z}\|AB-Z\|\quad \text{such that } P_\Omega Z = X\]This problem can then be solved using the same alternating least-squares approach we have used before. For example if we fix \(A,B\) then the optimal value of \(Z\) is given by \(Z = AB+X-P_\Omega(Z)\), and at each iteration we can update \(A\) and \(B\) by solving a linear least-squares problem. It is important to note that this way \(Z\) is a sum of a low-rank and a sparse matrix at every step, and this allows us to still efficiently manipulate it and store it in memory.
Although not very difficult, the implementation of this algorithm is a little too technical for this blog post. Instead we can just look at the results. I used this algorithm to fit matrices \(A\) and \(B\) of rank 5 and of rank 10 to the Netflix prize dataset. I used 3000 iterations of training, taking the better part of a day to train on my computer. I could probably do more, but I’m too impatient. The progress of training is shown below.
import os.path
plt.figure(figsize=DEFAULT_FIGSIZE)
DATASET_PATH = "/mnt/games/datasets/netflix/"
for r in [10, 5]:
model = np.load(os.path.join(DATASET_PATH, f"rank-{r}-model.npz"))
A = model["X"]
B = model["Y"]
train_errors = model["train_errors"]
test_errors = model["test_errors"]
plt.plot(np.sqrt(train_errors), label=f"Train rank {r}")
plt.plot(np.sqrt(test_errors), label=f"Test rank {r}")
plt.legend()
plt.ylim(0.8, 1.5)
plt.xlabel("Training iterations")
plt.ylabel("Root mean squared error (RMSE)");
We see that the training error for the rank 5 and rank 10 models are virtually identical, but the test error is lower for the rank 5 model. We can interpret this as the rank 10 model overfitting more, which is often the case for more complex models.
Next, how can we use this model? Well, the rows of the matrix \(A\) correspond to movies, and the columns of matrix \(B\) correspond to users. So if we want to know how much user #179 likes movie #2451 (Lord of the Rings: The Fellowship of the Ring), then we compute \(A[2451]\cdot B[:, 179]\):
A[2451] @ B[:, 179]
4.411312294862265
We see that the expected rating (out of 5) for this user and movie is about 4.41. So we expect that this user will like this movie, and we may choose to recommend it.
But we want to find the best recommendation for this user. To do this we can simply compute the product \(A \cdot B[:,179]\), which will give a vector with expected rating for every single movie, and then we simply sort. Below we can see the 5 movies with the highest and lowest expected ratings for this user.
import pandas as pd
movies = pd.read_csv(
os.path.join(DATASET_PATH, "movie_titles.csv"),
names=["index", "year", "name"],
usecols=[2],
)
movies["ratings-179"] = A @ B[:, 179]
movies.sort_values("ratings-179", ascending=False)
name | ratings-179 | |
---|---|---|
10755 | Kirby: A Dark & Stormy Knight | 9.645918 |
15833 | Paternal Instinct | 7.712654 |
15355 | Last Hero In China | 7.689984 |
14902 | Warren Miller's: Ride | 7.624472 |
2082 | Blood Alley | 7.317524 |
... | ... | ... |
463 | The Return of Ruben Blades | -6.037189 |
12923 | Where the Red Fern Grows 2 | -6.153577 |
7067 | Eric Idle's Personal Best | -6.441100 |
538 | Rumpole of the Bailey: Series 4 | -6.740144 |
4331 | Sugar: Howling of Angel | -7.015818 |
17769 rows × 2 columns
Note that the expected ratings are not between 0 and 5, but can take on any value (in particular non-integer ones). This is not necessarily a problem, because we only care about the relative rating of the movies.
To me, all these movies all sound quite obscure. And this makes sense, the model does not take factors such as popularity of the movie into account. It also ignores a lot of other data that we may know about the user, such as their age, gender and location. It ignores when the movie is released, and it doesn’t take into account the dates of all the movie ratings of each user. These are all important factors, that could significantly improve the quality of this the recommendation system.
We could try to modify our matrix completion model to take these factors into account, but it’s not obvious how to do this. There is no need to do this however, we use the matrices \(A\), \(B\) to augment any data we have about the movie and the user. And then we can train a new model on top of this data, to create something even better.
We can think of the movies as lying in a really high-dimensional space, and the matrix \(A\) maps this space onto a much smaller space. The same is true for the \(B\) and the ‘space’ of users. We can then use this embedding into a lower dimensional space as the input of another model.
Unfortunately we don’t have access to more information about the users (due to obvious privacy concerns), so this is difficult to demonstrate. But the point is this: the decomposition \(X\approx AB\) is both interpretable, and can be used as a building block for more advanced machine learning models.
In summary we have seen that low-rank matrix decompositions have many useful applications in machine learning. They are powerful because they can be learned using relatively little data, and have the ability to complete missing data. Unlike many other machine learning models, computing low-rank matrix decompositions of data can be done quickly.
Even though they come with some limitations, they can always be used as a building block for more advanced machine learning models. This is because they can give an interpretable, low-dimensional representation of very high-dimensional data. We also didn’t even come close to discussing all their applications, or algorithms on how to find and optimize them.
In the next post I will look at a generalization of low-rank matrix decompositions: tensor decompositions. While more complicated, these decompositions are even more powerful at reducing the dimensionality of very high-dimensional data.
]]>Keeping both a web, Word, and PDF version all up-to-date and easy to edit seemed like an annoying task. I have plenty experience with automatically generating PDF documents using LaTeX and Python, so I figured why should a Word document be any different? Let’s dive into the world of editing Word documents in Python!
Fortunately there is a library for this: python-docx
. It can be used to create
Word documents from scratch, but stylizing a document is a bit tricky. Instead,
its real power lies in editing pre-made documents. I went ahead and made a nice
looking CV in Word, and now let’s open this document in python-docx
. A Word
document is stored in XML under the hoods, and there can be a complicated tree
structure to a document. However, we can create a document and use the
.paragraphs
attribute for a complete list of all the paragraphs in the
document. Let’s take a paragraph, and print it’s text content.
from docx import Document
document = Document("resume.docx")
paragraph = document.paragraphs[0]
print(paragraph.text)
Rik Voorhaar
Turns out the first paragraph contains my name! Editing this text is very easy;
we just need to set a new value to the .text
attribute. Let’s do this and safe
the document.
paragraph.text = "Willem Hendrik"
document.save("resume_edited.docx")
Below is a picture of the resulting change; it unfortunately seems like two additional things happened when editing this paragraph: the font of the edited paragraph changed, and the bar / text box on the right-hand side disappeared completely!
This is no good, but to understand what happened to the text box we need to dig into the XML of the document. We can turn the document into an XML file like so:
document = Document("resume.docx")
with open('resume.xml', 'w') as f:
f.write(document._element.xml)
It seems the problem was that the text box on the right was nested inside an other object, which is apparently not handled properly. This issue was easy to fix by modifying the Word document. However, the right bar on the side consists of 2 text boxes, and the top box with my contact information does disappear if I change the first paragraph. But, it does not disappear if I change the second paragraph; it only happens if I change paragraph 1 or 3 (and the latter is empty). I tried inserting two paragraphs before this particular paragraph, or changing the style of this particular paragraph, but the issue remains.
Looking at the XML the issue is clear: the text box element lies nested inside this paragraph! It turned out to be a bit tricky to avoid this, so for now let us then try changing the second paragraph, changing the word “resume” for “curriculum vitae”.
document = Document("resume.docx")
paragraph = document.paragraphs[1]
print(paragraph.text)
paragraph.text = "Curriculum Vitae"
document.save("CV.docx")
Resume
If we do this there’s no problems with text boxes disappearing, but unfortunately the style of this paragraph is still reset when we do this. Let’s have a look at how the XML changes when we edit this paragraph. Ignoring irrelevant information, before changing it looks like this:
<w:p>
<w:r>
<w:t>R</w:t>
</w:r>
<w:r>
<w:t>esume</w:t>
</w:r>
</w:p>
And afterwards it looks like this:
<w:p>
<w:r>
<w:t>Curriculum Vitae</w:t>
</w:r>
</w:p>
In Word, each paragraph (<p>
) is split up in multiple runs (<r>
). What we
see here is that originally the paragraph was two runs, and after modifying it,
it became a single run. However, it seems that in both cases the style
information is exactly the same, so I don’t understand why the style changes
after modification. In this case if I retype the word ‘Resume’ in the original
word document, this paragraph become a single run, but still the style changes
after editing, and I still don’t see why this happens when looking at the XML.
Looking at the source code of python-docx
I noticed that when we call
paragraph.text = ...
, what happens is that the contents of the paragraph get
deleted, and then a new run is added with the desired text. It is not clear to
me at where exactly the style information is stored, but either way there is a
simple workaround to what we’re trying to do: we can simply modify the text of
the first run in the paragraph, rather than clearing the entire paragraph and
adding a new one. This in fact also works for editing the first paragraph,
where before we had problems with disappearing text boxes:
document = Document("resume.docx")
with open('resume.xml', 'w') as f:
f.write(document._element.xml)
# Change 'Rik Voorhaar' for 'Willem Hendrik Voorhaar'
paragraph = document.paragraphs[0]
run = paragraph.runs[1]
run.text = 'Willem Hendrik Voorhaar'
# Change 'Resume' for 'Curriculum Vitae'
paragraph = document.paragraphs[1]
run = paragraph.runs[0]
run.text = 'Curriculum Vitae'
document.save('CV.docx')
Doing this changes the text, but leaves all the style information the same. Alright, now we now how to edit text. It’s more tricky than one might expect, but it does work!
Let’s say that next we want to edit the text box on the right-hand side of the document, and add a skill to our list of skills. We’ve been diving into the inner workings of Word documents, so it’s fair to say we know how to use Microsoft Word, so let’s add the skill “Microsoft Word” to the list.
To do this we first want to figure out in which paragraph this information is stored. We can do this by going through all the paragraphs in the document and looking for the text “Skills”.
import re
pattern = re.compile("Skills")
for p in document.paragraphs:
if pattern.search(p.text):
print("Found the paragraph!")
break
else:
print("Did not find the paragraph :(")
Did not find the paragraph :(
Seems like there is unfortunately no matching paragraph! This is because the
paragraph we want is inside a text box, and modifying text boxes is not supported
in python-docx
. This is a known issue, but instead of giving up I decided to
add support for modifying text boxes to python-docx
myself! It turned out not to
be too difficult to implement, despite my limited knowledge of both the package
and the inner structure of Word documents.
The first step is understanding how text boxes are encoded in the XML. It turns out that the structure is something like this:
<mc:AlternateContent>
<mc:Choice Requires="wps">
<w:drawing>
<wp:anchor>
<a:graphics>
<a:graphicData>
<wps:txbx>
<w:txbxContent>
...
<w:txbxContent>
</wps:txbx>
</a:graphicData>
</a:graphics>
</wp:anchor>
</w:drawing>
</mc:Choice>
<mc:Fallback>
<w:pict>
<v:textbox>
<w:txbxContent>
...
<w:txbxContent>
</v:textbox>
</w:pict>
</mc:Fallback>
</mc:AlternateContent>
The insides of the two <w:txbxContent>
elements are exactly identical. The
information is stored twice probably for legacy reasons. A quick Google reveals
that wps
is an XML namespace introduced in Office 2010, and WPS is short for
Word Processing Shape. The textbox is therefore stored twice to maintain
backwards compatibility with older Word versions. Not sure many people still use
Office 2006… Either way, this means that if we want to update the contents of
the textbox, we need to do it in two places.
Next we need to figure out how to manipulate these word objects. My idea is to
create a TextBox
class, that is associated to an <mc:AlternateContent>
element, and which ensures that both <w:txbxContent>
elements are always
updated at the same time. First we make a class encoding a <w:txbxContent>
element. For this we can build on the BlockItemContainer
class already
implemented in python-docx
. Mixing in this class gives automatic support for
manipulating paragraphs inside of the container.
class TextBoxContent(BlockItemContainer)
Given an <mc:AlternateContent>
object, we can access the two <w:txbxContent>
elements using the following XPath specifications:
XPATH_CHOICE = "./mc:Choice/w:drawing/wp:anchor/a:graphic/a:graphicData//wps:txbx/w:txbxContent"
XPATH_FALLBACK = "./mc:Fallback/w:pict//v:textbox/w:txbxContent"
Then making a rudimentary TextBox
class is very simple. We base it on the
ElementProxy
class in python-docx
. This class is meant for storing and
manipulating the children of an XML element.
class TextBox(ElementProxy):
"""Implements texboxes. Requires an `<mc:AlternateContent>` element."""
def __init__(self, element, parent):
super(TextBox, self).__init__(element, parent)
try:
(tbox1,) = element.xpath(XPATH_CHOICE)
(tbox2,) = element.xpath(XPATH_FALLBACK)
except ValueError as err:
raise ValueError(
"This element is not a text box; it should contain precisely two \
``<w:txbxContent>`` objects"
)
self.tbox1 = TextBoxContent(tbox1, self)
self.tbox2 = TextBoxContent(tbox2, self)
So far this is just good for storing the text box, we still need some code to
actually manipulate it. It would also be great if we have a way to find all the
text boxes in a document. This is as simple as finding all the
<mc:AlternateContent>
elements with precisely two <w:txbxContent>
elements.
We can use the following function:
def find_textboxes(element, parent):
"""
List all text box objects in the document.
Looks for all ``<mc:AlternateContent>`` elements, and selects those
which contain a text box.
"""
alt_cont_elems = element.xpath(".//mc:AlternateContent")
text_boxes = []
for elem in alt_cont_elems:
tbox1 = elem.xpath(XPATH_CHOICE)
tbox2 = elem.xpath(XPATH_FALLBACK)
if len(tbox1) == 1 and len(tbox2) == 1:
text_boxes.append(TextBox(elem, parent))
return text_boxes
We then update the Document
class with a new textboxes
attribute:
@property
def textboxes(self):
"""
List all text box objects in the document.
"""
return find_textboxes(self._element, self)
Now let’s test this out:
document = Document("resume.docx")
document.textboxes
[<docx.oxml.textbox.TextBox at 0x7faf395c3bc0>,
<docx.oxml.textbox.TextBox at 0x7faf395c3100>]
Now to manipulate the “Skills” section as we initially wanted, we first find the
right paragraph. Since the two <w:txbxContent>
objects have the same
paragraphs, we need to find which number of paragraph contains the text, and
in which textbox:
import re
def find_paragraph(pattern):
for textbox in document.textboxes:
for i,p in enumerate(textbox.paragraphs):
if pattern.search(p.text):
return textbox,i
pattern = re.compile("Skills")
textbox, i = find_paragraph(pattern)
print(textbox.paragraphs[i].text)
Skills
Now to insert a new skill, we need to create a new paragraph with the text
“Microsoft Word”. For this we can find the paragraph right after, and this
paragraphs insert_paragraph_before
method with appropriate text and style
information. The paragraph in question is the one containing the word
“Research”. I want to copy the style of this paragraph to the new paragraph, but
for some reason the style information is empty for this paragraph. However, I
know that the style of this paragraph should be the 'Skillsentries'
, so I can
just use that directly.
style = document.styles['Skillsentries']
pattern = re.compile("Research")
textbox,i = find_paragraph(pattern)
p1 = textbox.tbox1.paragraphs[i]
p2 = textbox.tbox2.paragraphs[i]
for p in (p1,p2):
p.insert_paragraph_before("Microsoft Word", p.style)
document.save("CV.docx")
When now opening the Word document, we see the item “Microsoft Word” in my list of skills, with the right style and everything. I did cheat a little; I needed to make some additional technical changes to the code for this all to work, but the details are not super important. If you want to use this feature, you can use my fork of python-docx. My solution is still a little hacky, so I don’t think it will be added to the main repository, but it does work fine for my purposes.
In summary, we can use Python to edit word documents. However the
python-docx
package is not fully mature, and using it for editing
highly-stylized word documents is a bit painful (but possible!). It is however
quite easy to extend with new functionality, in case you do need to do this. On
the other hand, there is quite extensive functionality in Visual Basic to edit
word documents, and the whole Word API is built around Visual Basic.
While I now have all the tools available to automatically update my CV using Python, I will actually refrain from doing it. It is a lot of work to set up properly, and needs active maintenance ever time I would want to change the styling of my CV. Probably it’s a better idea to just manually edit it every time I need to. Automatization isn’t always worth it. But I wouldn’t be surprised if this new found skill will be useful at some point in the future for me.
]]>What constitutes as ‘realistic’ blur obviously depends on context, but in the case of taking pictures with a hand-held camera or smartphone, it includes both motion blur and a form of lens blur. Generating lens blur is easy; we can just use a Gaussian blur. For motion blur we previously looked only at straight lines, but this isn’t very realistic. Natural motion is rarely just in a straight line, but is more erratic.
To model this we can take inspiration from physical processes such as Brownian motion: we can model motion blur as the path taken by a particle with an initial velocity, which is constantly perturbed during the motion. We want to add gaussian blur on top of that, which can simply be done by taking the image of such a path and convolving it with a gaussian point spread function. However, we should also take into account the speed of the particle; if we move a camera very fast then the camera spends less exposure time in any particular point. Therefore we should make the intensity of the blur inversely proportional to the speed at any point. The end result looks something like this:
In practice we will consider this kind of blur at a much smaller resolution, for example of size 15x15. Below we show how such a kernel will affect for example the St. Vitus image.
Recall that in the Richardson-Lucy algorithm we try to solve the deconvolution problem \(y=x*k\) by using an iteration of form
\[x_{i+1} = x_i\odot \left(\frac{y}{x_i*k}*k^*\right)\]This method is completely symmetric in \(k\) and \(x\), so given an estimate \(x_i\) of \(x\) we can recover the kernel \(k\) by the same method:
\[k_{j+1} = k_j\odot \left(\frac{y}{x*k_j}*x^*\right)\]A simple idea for blind deconvolution is therefore to alternatingly estimate \(k\) from \(x\) and vice-versa. We can see the result of this procedure below:
The problem with this Richardson-Lucy-based algorithm is that the point spread function tends to converge to a (shifted) delta function. This is an inherit problem with many blind deconvolution algorithm, especially those based on finding a maximum a posteriori (MAP) estimate of both the kernel and image combined. For this particular algorithm it isn’t immediately obvious why it tends to do this, since the analysis of this algorithm is relatively complicated. Somehow the kernel update step tends to promote sparsity. This tends to happen irrespective of how we initialize the point spread function, or the relative number of steps spent estimating the PSF or the image.
There are heuristic ways to get around this, but overall it is difficult to make a technique like this work well. It also doesn’t use the wonderful things we learned about image priors in part 2. We need a method that can actively avoid converging to extreme points such as this delta function.
In part 2 we discussed different image priors, of which the most promising prior is based on non-local self-similarity. This assigns to an image \(x\) a score \(L(x)\) signifying how ‘natural’ this image is. We saw that it indeed gave higher scores for images that are appropriately sharpened. A simple idea is then to try different point spread functions, and use the one with the highest score. If we denote \(x(k)\) the result of applying deconvolution with kernel \(k\), then we want to solve the maximization problem: \(\max_{k}L(x(k))\)
If we naively try to maximize this function we run into the problem that the space of all kernels is quite large; a \(15\times 15\) kernel obviously need \(15^2=225\) parameters. Since computing the image prior is relatively expensive (as is the deconvolution), exploring this large space is not feasible. Moreover, the function is relatively noisy, and has the problem that it can give large scores to oversharpened images.
We therefore need a way to describe the point spread functions using only a few parameters. Moreover, this description should actively avoid points that are not interesting, such as a delta function or a point spread function that would result in heavy oversharpening of the image.
There are many ways to describe a point spread function using only a couple parameters. One way that I propose is by writing it as a sum of a small number of Gaussian point spread functions. However instead of having the centered symmetric Gaussians we have considered so far, we will allow an arbitrary mean and covariance matrix. This changes respectively the center and the shape of the point spread function. That is, it depends on the parameters \(\mu=(\mu_1,\mu_2)\) and a 2x2 (symmetric, positive definite) matrix \(\Sigma\). Then the point spread function is given by
\[k[i,j]\propto \exp\left((i-\mu_1,j-\mu_2)\Sigma^{-1}(i-\mu_1,j-\mu_2)^\top\right),\qquad\sum_{i,j}k[i,j]=1\]To be precise, we can describe the covariance matrix \(\Sigma\) using three parameters \(\lambda_1,\lambda_2>0\) and \(\theta\in[0,\pi)\) using the decomposition
\[\Sigma = \begin{pmatrix}\cos\theta &\sin\theta\\-\sin\theta&\cos\theta\end{pmatrix} \begin{pmatrix}\lambda_1&0\\0&\lambda_2\end{pmatrix} \begin{pmatrix}\cos\theta &-\sin\theta\\\sin\theta&\cos\theta\end{pmatrix}\]We then use an additional parameter to combine different kernels of this type. By taking \(t_1k_1+t_2k_2+\dots+t_nk_n\)
This gives a total of 6 parameters per mixture component, but for the first component we can set the mean \(\mu\) to \(0\) and use a magnitude \(t_1\) of 1, reducing to 3 parameters. For now we will try to use a mix just use two mixture components \(n=2\), and focus our attention on how to optimize this.
We now know how to parameterize the point spread functions, and what function we want to optimize (the image prior). Next is deciding how to optimize this. In this case we have a complicated, noisy function that is expensive to compute, and with no easy way to compute its derivatives. In situations like these Bayesian optimization or other methods of ‘black-box’ optimization make the most sense.
How this works is that we sample our function \(L\colon \Omega\to \mathbb R\) in several points \((z_1,\dots,z_n)\in\Omega\), where \(\Omega\) is our parameter searchspace. Based on these samples, we build a surrogate model \(\widetilde L\colon\Omega\to \mathbb R\) for the function \(L\). We can then optimize the surrogate model \(\widetilde L\) to obtain a new point \(z_{n+1}\). We then compute \(L(z_{n+1})\), and update the surrogate model with this new information. This is then repeated a number of times, or until convergence. So long as the surrogate model is good, this can find an optimal point for the function \(L\) of interests much faster than many other optimization methods.
The key property of this surrogate model is that it should be easy to compute, yet still model the true function reasonably well. In addition to this, we want to incorporate uncertainty into the surrogate model. Uncertainty enters in two ways: the function \(L\) may be noisy, and there is the fact that the surrogate model will be more accurate closer to previously evaluated points. This leads to Bayesian optimization. The surrogate model is probabilistic in nature, and during optimization we can sample points both to reduce the variance (explore regions where the model is unsure), and to reduce the expectation (explore regions of the searchspace where the model things the optimal point should lie).
One type of surrogate model that is popular for this purpose is the Gaussian process (GP) (also known as ‘kriging’ in this context). We will give a brief description of Gaussian processes. We model the function values of the surrogate model \(\widetilde L\) as random variables. More specifically we model the function value at a point \(z\) to depend on the samples:
\[\widetilde L(z) | z_1,\dots,z_n \sim N(\mu,\sigma^2),\]where the mean \(\mu\) is a weighted average of the values at the sampled points \((z_1,\dots, z_n)\), weighted by the distance \(\|z-z_i\|\). The variance \(\sigma^2\) is determined by a function \(K(z,z') = K(\|z-z'\|)\) which gives the covariance between two points, and increases the more distant the points. Note that \(K\) only depends on the distance between two points. At the sampled points \((z_1,\dots,z_n)\) we know the function \(\widetilde L(z)\) to high accuracy, and hence \(K(z_i,z_i) = K(0)\) is small, but as we go further from any of the sampled points the variance increases.
Because of the specific structure of the Gaussian process model, it is easy to fit to data and make
predictions at new points. As a result an optimal value for this surrogate model is easy to compute.
We will use an implementation of GP-based Bayesian optimization from scikit-optimize
. All in all
this gives us the results shown below.
As we can see, the estimated point spread function is still far from perfect, but nevertheless the deblurred image looks better than the blurred image. If we blur the image with larger kernels, or stronger blur overall recovery becomes even harder with this method. If we apply it to a different image the result is comparable. One problem that is apparent is the fact that the point spread function tends to shift the image. This can fortunately be corrected, by either changing the point spread function or shifting the image after deconvolution.
There are probably several reasons why this model doesn’t give perfect results. First is that the image prior isn’t perfect, but it seems that most image priors tend to give quite noisy outputs, or give high scores due to artifacts created by the deconvolution algorithm. Secondly, the parameter space of this model is still quite big, especially if the prior function depends in complicated manners on these parameters. However, it seems many methods used in the literature use even larger searchspaces for the kernels, many algorithms even using no compression of the searchspace at all and still claiming good results.
While I knew from the get-go that blind deconvolution is hard, it turned out to be even harder to do right than I expected. I read a lot of literature on the subject, and I learned a lot. Many papers give interesting algorithms and ideas for blind deconvolution methods. What I found however is that most papers where quite vague in their description and almost never included code. This makes doing research in this field quite difficult, since it can be very difficult to estimate whether or not a method is actually useful. Moreover, if a method looks promising then implementing it can become very difficult without adequate details.
]]>We will explore two methods to improve the deconvolution method. First is a simple modification to our current method, and second is an more expensive iterative method for deconvolution that works better for sparse kernels.
Recall that deconvolution comes down to solving the equation \(y = k*x,\) where \(y\) is the observed (blured) image, \(k\) is the point-spread function, \(x\) is the unobserved (sharp) images. If we take a discrete Fourier Transform (DFT) then this equation becomes \(Y = K\odot X,\) where capital letters denote the Fourier-transformed variables, and \(\odot\) is the pointwise multiplication. To solve the deconvolution problem we can then do pointwise division by \(K\) and then do the inverse Fourier transform. Because \(K\) may have zero or near-zero entries, we can run into numerical instability. A quick fix is to instead multiply by \(K^* / (|K|^2+\epsilon)\), giving solution
\[x = \mathcal F^{-1}\left(Y \odot \frac{K^*}{|K|^2+\epsilon}\right)\]This is fast to compute, and gives decent results. This simple method of deconvolution is known as the Wiener filter. In the situation where there is some noise \(n\) such that \(y=k*x+n\), this corresponds (for a certain value of \(\epsilon\)) to \(x^*\) minimizing the expected square error \(E(\|x-x^*\|^2)\). Instead of minimizing the error, we can accept that \(\|k*x-y\|\approx \|n\|^2\), and then find the smoothest \(x^*\) with that error, to avoid ringing artifacts. Smoothness can be modeled by the laplacian \(\Delta x^*\) This leads to the problem
\[\begin{array}{ll} \text{minimize} & \Delta x \\ \text{subject to} & \|k*x-y\|\leq \|n\|^2 \end{array}\]If \(L\) is the Fourier transform of the Laplacian kernel, then the solution to this problem has form
\[x = \mathcal F^{-1}\left(Y \odot \frac{K^*}{|K|^2+\gamma |L|^2}\right)\]where the parameter \(\gamma>0\) is determined by the noise level. In the end this is a simple modification to the Wiener filter, that should give less ringing effects. Let’s see what this does in practice.
In the picture above we tried to deblur a motion blur consisting of a diagonal strip of 10 pixels. The deblurring is done with a kernel of 9.6 pixels (the last pixel on either end is dimmed). We do this both with and without the Laplacian, with amounts of regularization so that the two methods have a similar amount of ringing artifacts. The two methods look very similar, and if anything the method without Laplacian may look a little sharper. The reason the methods behave so similarly is probably because the Fourier transform of the Laplacian (show below) has a fairly spread-out distribution and is therefore not too different from a uniform distribution we use in the Wiener filter.
There are many iterative deconvolution methods, and one often-used method in particular is Richardson-Lucy decomposition. The iteration step is given by
\[x_{k+1} = x_k\odot \left(\frac{y}{x_k*k}*k^*\right)\]Here \(k^*\) is the flipped point spread function, its Fourier transform is the complex conjugate of the Fourier transform of \(k\). As first iteration we typically pick \(x_0=y\). Note that if \(\sum_{i,j}k_{ij} = 1\), then \(\mathbf 1*k = \mathbf 1\), with \(\mathbf 1\) a constant 1 signal. Therefore if we plug in \(x_k = \lambda x\) we obtain
\[x_{k+1} = x_k \odot \left(\frac{y}{\lambda y}*k^*\right) = x_k\odot \frac1\lambda \mathbf 1*k^* = \frac{x_k}\lambda\]This both shows that \(x\) is a fixed point of the Richardson-Lucy algorithm, and at the same time it show that the algorithm doesn’t necessarily converge, since it could alternate between \(2x\) and \(x/2\) for example. In practice on natural images, if initialized with \(x_0=y\), it does seem to converge. Below we try this algorithm for different number of iterations, considering the same image and point spread function as before.
We see very similar ringing artifacts as with the Wiener filter. The number of iterations of the algorithm is related to the size of the regularization constant. The more iterations, the sharper the image is, but also the more pronounced the ringing artifacts are.
Like with the Wiener filter, we need to add a small positive constant when dividing, to avoid division-by-zero errors. Unlike the Wiener filter however, Richardson-Lucy deconvolution is very insensitive to the amount of regularization used.
Richardson-Lucy deconvolution is much slower than Wiener filter, requiring perhaps 100 iterations to
reach good result. Each iteration takes roughly as long as applying the Wiener filter. Fortunately
the algorithm is easy to implement on a GPU, and each iteration of the (426, 640) image above takes
only about 1ms on my computer with a simple GPU implementation using cupy
.
One issue that I have so far swept under the rug is the problem of boundary effects. If we convolve an \((n,m)\) image by a \((\ell,\ell)\) kernel, then the result is an image of size \((n+\ell-1, m+\ell-1)\), and not \((n,m)\). There is typically a ‘fuzzy border’ around the image, which we crop away when displaying, but not when deconvolving. In real life we don’t have the luxury of including this fuzzy border around the image, and this can lead to heavy artifacts when deconvolving an image. Below is the St. Vitus church image blurred with \(\sigma=3\) Gaussian blur, and subsequently deblurred using a Wiener filter with and without using the border around the image.
The ringing at the boundary is known as Gibbs oscillation. The reason it occurs is because the deconvolution method implicitly assumes the image is periodic. This is because the convolution theorem (stating that convolution becomes multiplication after a (discrete) Fourier transform) needs the assumption that the signal is periodic. If we would periodically stack a natural image we would find a sudden sharp transition at the boundary, and this contributes to high-frequency components in the Fourier transform, giving the sharp oscillations at the boundary.
The more we regularize the deconvolution, the less big the boundary effects. This is because regularization essentially acts as a low-pass filter, getting rid of high-frequency effects. However, this also blurs the image considerably. For Richardson-Lucy deconvolution we essentially have the same problem.
The straightforward to deal with this problem is to extend the image to mimmick the ‘fuzzy’ border introduces by convolution. Or better yet, we should pad the image in such a way that the image is as regular as possible when stacked periodically. This is a strategy employed by Liu and Jia, they extend the image to be periodic by using three different ‘tiles’ stacked in a pattern shown below. The image is then cropped to the dotted line, and this gives a periodic image. The tiles are optimized such that the image is continuous along each boundary, and such that the total Laplacian is minimized.
There are many similar methods in the literature. Unfortunately, all of these methods are complicated, and very few methods include a reference implementation. If there is one, it is almost always in Matlab. This seems to be a general problem when reading literature about (de)convolution and image processing, for some reason in this scientific community it is not standard practice to include code with papers, and descriptions of algorithms are often vague require significant work to translate to working code. I found a Python implementation of Liu-Jia’s algorithm at this github.
Below we see the Laplacian of the image extended using Liu-Jia’s method, using zero padding and by reflecting the image. We see that both in the reflected image, and the one using Liu-Jia’s method there are no large values of the Laplacian around the border, because of the soft transition to the border.
Next we can check if these periodic extensions of the images actually reduces boundary artifacts when deconvolving. Below we see the three methods for both the Wiener and Richardson-Lucy (RL) deconvolution in action on an image distorted with \(\sigma=3\) Gaussian blur.
We can see that the Liu-Jia’s method gives a significant improvement, especially for the Wiener filter. More strikingly, the reflective padding works even better. This is because the convolution that the distorted the image implicitly used reflective padding as well. If you change the settings of the convolution blurring the image, then the results will not be as good. Liu-Jia’s method probably works the best out-of-the box on images blurred by natural means.
It is interesting to note that Richardson-Lucy deconvolution suffers heavily in quality regardless of padding method. Interestingly, if we look at motion blur instead of Gaussian blur, the roles are a bit reversed. For the Wiener filter we have to use fairly aggressive regularization to not get too many artifacts, whereas RL deconvolution works without problems.
We have reiterated the fact that even non-blind deconvolution can be a difficult problem. The relatively simple Wiener filter in general does a good job, and changing it to use a Laplacian for regularization doesn’t seem to help much. The Richardson-Lucy algorithm often performs comparably to the Wiener filter, although it seems to perform relatively better for sparse kernels like the motion blur kernel we used.
Before we have completely ignored boundary problems, which is not something we can do with real images. Fortunately, we can deal with these issues by appropriately padding the image. Simply using reflections of the image for padding works quite well, especially depending on how we blur the image in the first place. Extending the image to be periodic while minimizing the Laplacian is more complicated, but also works well, and probably performs better in natural images.
In the next part (and hopefully final part) we will dive into some simple approaches for blind deconvolution. Starting off with a modification of the Richardson-Lucy algorithm, and then trying to use what we learned about image priors in part 2.
]]>The next step is then to try to do deconvolution if we have partial information about how the image was distorted. For example, we know that a lens is out of focus, but we don’t know exactly by how much. In that case we have only one variable to control, a scalar amount of blur (or perhaps two if the amount of blur is different in different directions). In this case we can simply try deconvolution for a few values, and look which image seems most natural.
Below we have the image of the St. Vitus church in my hometown distorted with gaussian blur with \(\sigma=2\), and then deblurred with several different values of \(\sigma\). Looking at these images we can see that \(\sigma=2.05\) and \(\sigma=2.29\) looks best, and \(\sigma=2.53\) is over-sharpened. The real challenge lies in finding some concrete metric to automatically decide which of these looks most natural. This is especially hard since even to the human eye this is not clear. The fact that \(\sigma=2.29\) looks very good probably means that the original image wasn’t completely sharp itself, and we don’t have a good ground truth of what it means for an image to be perfectly sharp.
Measures of naturality of an image are often called image priors. They can be used to define a prior distribution on the space of all images, giving higher probability to images that are natural over those that are unnatural. Often image priors are based on heuristics, and different applications need different priors.
Many simple but effective image priors rely on the observation that most images have a sparse gradient distribution. An edge in an image is a sharp transition. The gradient of an image measures how fast the image is changing at every point, so a an edge is region in the image where the gradient is large. The gradient of an image can be computed by convolution with different kernels. One such kernel is the Sobel kernel:
\[S_x = \begin{pmatrix} 1 & 0 & -1 \\ 2 & 0 & -2 \\ 1 & 0 & -1 \end{pmatrix}, \quad S_y = \begin{pmatrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ -1 & -2 & -1 \end{pmatrix}\]Here convolution with \(S_x\) gives the gradient in the horizontal direction, and it is large when encountering a vertical edge, since the image is then making a fast transition in the horizontal direction. Similarly \(G_y\) gives the gradient in the vertical direction. If \(X\) is our image of interest, we can then define the gradient transformation of \(X\) by
\[|\nabla X| = \sqrt{(S_x * X)^2+(S_y * X)^2}\]Below we can see this gradient transformation in action on the six images shown above:
Here we can see that the gradients become larger in magnitude as \(\sigma\) increases. For \(\sigma = 2.47\) we see that a large part of the image is detected as gradient – edges stopped being sparse at this point. For the first four images we see that the edges are sparse, with most of the image consisting of slow transitions.
Below we look at the distribution of the gradients after deconvolution with different values of \(\sigma\). We see that the distribution stays mostly the constant, slowly increasing in overall magnitude. But near \(\sigma=2\), the overall magnitude of gradients suddenly increases sharply.
This suggests that to find the optimal value of \(\sigma\) we can look at these curves and pick the value of \(\sigma\) where the gradient magnitude starts to increase quickly. This is however not very precise, and ideally we have some function which has a minimum near the optimal value of \(\sigma\). Furthermore this curve will look slightly different for different images. This is a good starting point for an image prior, but is not useful yet.
Instead of using the gradient to obtain the edges in the image, we can use the Laplacian. The gradient \(|\nabla X|\) is the first derivative of the image, whereas the Laplacian \(\Delta X\) is given by the sum of second partial derivatives of the image. Near an edge we don’t just expect the gradient to be big, but we also expect the gradient to change fast. This is because edges are usually transient, and not extended throughout space.
We can compute the Laplacian by convolving with the following kernel:
\[\begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix}\]Note that the Laplacian can take on both negative and positive values, unlike the absolute gradient transform we used before. Below we show the absolute value of the Laplacian transformed images. This looks similar to the absolute gradient, except that the increase in intensity with increasing \(\sigma\) is more pronounced.
Above we can see that there is an overall increase in the magnitude of the gradients and Laplacian as \(\sigma\) increases. We want to measure how sparse these gradient distributions are, and this has more to do with the shape of the distribution rather than the overall magnitude. To better see how the shape changes it therefore makes sense to normalize so that the total magnitude stays the same. We therefore don’t consider the distribution of the gradient \(|\nabla X|\), but rather of the normalized gradient \(|\nabla X| / \|\nabla X\|_2\). Since the mean absolute value is essentially the \(\ell_1\)-norm, this is also referred to as the \(\ell_1/\ell_2\)-norm of the gradients \(\nabla X\).
The normalized gradient distribution is plotted below as function of \(\sigma\), the distributions of the Laplacian look similar. This distribution already looks a lot more promising since the median has a minimum near the optimal value for \(\sigma\). This minimum is a passable estimate of the optimal value of \(\sigma\) for this particular image. For other images it is however not as good. Moreover the function only changes slowly around the minimum value, so it is hard to find in an optimization routine. We therefore need to come up with something better.
The \(\ell^1/\ell^2\) prior is a good starting point, but we can do better with a more complex prior based on non-local self-similarity. The idea is to divide the image up in many small patches of \(n\times n\) pixels with for example \(n=5\). Then for each patch we can check how many other patches in the image look similar to it. This concept is called non-local self-similarity, since it’s non-local (we compare a patch with patches throughout the entire image, not just in a neighborhood) and uses self-similarity (we look at how similar some parts of the image are to other parts of the same image; we never use an external database of images for example).
The full idea is a bit more complicated. Let’s denote each \(n\times n\) patch by
\[P(i,j) = X[ni:n(i+1),\, nj:n(j+1)].\]We consider this patch as a length-\(n^2\) vector. Moreover since we’re mostly interested in the patterns represented by the patch, and not by the overall brightness, we normalize all the patch vectors to have norm 1. We then find the closest matching \(k\) patches, minimizing the Euclidean distance:
\[\operatorname{argmin}_{i',j'} \|P(i,j) - P(i',j')\|\]Below we show an 8x8 patch in the St. Vitus image (top left) together with its 11 closest neighbors.
Note that we look at patches closest in Euclidean distance, this does not necessarily mean the patches are visually similar. Visually very similar patches can have large euclidean distance, for example the two patches below are orthogonal (and hence have maximal Euclidean distance), despite being visually similar. One could come up with better measures for visual similarity than Euclidean distance, probably something that is invariant under small shifts, rotations and mirroring, but this would come at an obvious cost of increased (computational) complexity.
The \(k\) closest patches together with the original patch \(P(i,j)\) are put into a \(n^2\times (k+1)\) matrix, called the non-local self-similar (NLSS) matrix \(N(i,j)\). We are interested in some linear-algebraic properties of this matrix. One observation is that the NLSS matrices tend to be of low rank for most patches. This essentially means that most patches tend to have other patches that look very similar to it. If all patches in \(N(i,j)\) are the same then its rank is 1, whereas if all the patches are different then \(N(i,j)\) is of maximal rank.
However, taking the rank itself is not necessarily a good measure, since it is not numerically stable. Any slight perturbation will always make the matrix of full rank. We rather work with a differentiable approximation of the rank. This approximation is based on the spectrum (singular values) of the matrix. In this case, we can consider the nuclear norm \(\|N(i,j)\|_*\) of \(N(i,j)\). It is defined as the sum of the singular values:
\[\|A\|_* = \sum_{i=1}^n \sigma_i(A),\]where \(\sigma_i(A)\) is the \(i\)th singular value. Below we show how the average singular values change with scale \(\sigma\) of the deconvolution kernel for the NLSS matrices for 8x8 patches with 63 neighbors (so that the NLSS matrix is square). We see that in all cases most of the energy is in the first singular value, followed by a fairly slow decay. As \(\sigma\) increases, the decay of singular values slows down. This means that the more blurry the image, the lower the effective rank of the NLSS matrices. As such, the nuclear norm of the NLSS matrix gives a measure of the amount of information in the picture.
We see that the spectrum of the NLSS matrices seem to give a measure of ‘amount of information’ or sparsity. Since we know that sparsity of the edges in an image gives a useful image prior, let’s compute the nuclear norm \(\|N(i,j)\|_*\) of each NLSS matrix of the gradients of the image. We can actually plot these nuclear norms as an image. Below we show this plot of nuclear norms of the NLSS matrices. We can see that the mean nuclear norm is biggest at around the ground truth value of \(\sigma\).
It is not immediately clear how to interpret the darker and lighter regions of these plots. Long straight edges seem to have smaller norms since there are many patches that look similar. Since the patches are normalized before being compared, the background tends to look a lot like random noise and hence has relatively high nuclear norm. However, we can’t skip this normalization step either, since then we mostly observe a strict increase in nuclear norms with \(\sigma\).
Repeating the same for the Laplacian gives a similar result:
Now finally to turn this into a useful image prior, we can plot how the mean nuclear norm changes with varying \(\sigma\). Both for the gradients and Laplacian of the image we see a clear maximum near \(\sigma=2\), so this looks like a useful image prior.
There are a few hyperparameters to tinker with for this image prior. There is the size of the patches taken, in practice something like 4x4 to 8x8 seems to work well for the size of images we’re dealing with. we can also lower or increase the number of neighbors computed. Finally we don’t need to divide the images into patches exactly. We can oversample, and put a space of less than \(n\) pixels between consecutive \(n\times n\) patches. This results in a less noisy curve of NLSS nuclear norms, at extra computational cost. We can on the other hand also undersample and only use a quarter of the patches, which can greatly improve speed.
The image above was made for \(6\times 6\) patches with 36 neighbors. Below we make the same plot with \(6\times 6\) patches, but only taking 1/16th of the patches and only 5 neighbors. This results in a much more noisy image, but it runs over 10x faster and still gives a useful approximation.
One final thing of note is how the NLSS matrices \(N(i,j)\) are computed. Finding the closest \(k\) patches through brute-force methods of computing the distance between each pair of patches is extremely inefficient. Fortunately there are more efficient ways to solving this similarity search problem. These methods usually first make an index or tree structure saving some information about all the data points. This can be used to quickly find a set of points that are close to the point of interest, and searching only within this set significantly reduces the amount of work. This is especially true if we only care about approximately finding the \(k\) closest points, since this mean we can reduce our search space even further.
We used Faiss to solve the similarity search problem, since it is fast and runs on GPU. There are many packages that do the same, some faster than others depending on the problem. There is also an implementation in sklearn
, but it is slower by over 2 orders of magnitude than Faiss when running on GPU for this particular situation.
At the end of the day the bottleneck in the computation speed is the computation of the nuclear norm. This in turn requires computing the singular values of tens of thousands of small matrices. Unfortunately CUDA only supports batched SVD computation of matrices of at most 32x32 in size, and indeed if we use 5x5 patches or smaller, we can make this up to 4x faster by doing the computation on GPU on my machine.
The nuclear norms of NLSS matrices seem to give a useful image prior, but to know for sure we need to test it for different images, and also for different types of kernels.
To estimate the best deconvolved image, will take the average of the optimal value for the NLSS nuclear norms of the gradient and Laplacian. This is because it seems that the Laplacian usually underestimates the ground truth value whereas the gradient usually overestimates it. Furthermore, instead of taking the global maximum as optimal value, we take the first maximum. When we oversharpen the image a lot, the strange artifacts we get can actually result in a large NLSS nuclear norm. It can be a bit tricky to detect a local maximum, and if the initial blur is too much then the prior seems not to work very well.
First let’s try to do semi-blind deconvolution for Gaussian kernels. That is, we know that the image was blurred with a Gaussian kernel, but we don’t know with what parameters. We do this for a smaller and a larger value for the standard deviation \(\sigma\), and notice that for smaller \(\sigma\) the recovery is excellent, but once \(\sigma\) becomes too large the recovery fails.
All the images we use are from the COCO 2017 dataset.
First up is an image of a bear, blurred with \(\sigma=2\) Gaussian kernel. Deblurring this is easy, and not very sensitive on the hyper parameters used.
Here is the same image of the bear, but now blurred with \(\sigma=4\), and it becomes much harder to recover the image. I found that the only way to do it is to reduce the patch size all the way to \(2\times 2\), for higher patch sizes the image can’t be accurately recovered and it always overestimates the value of \(\sigma\).
Below is a picture of some food. For \(\sigma=3\) recovery is excellent, and again not strongly dependent on hyperparameters. For \(\sigma=4\) the problem becomes significantly harder, and it again takes a small patch size for reasonable results.
Now let’s change the blur kernel to an idealized motion blur kernel. Here the point spread function is a line segment of some specified length and thickness, as shown below:
The way I construct these point spread functions is by rasterizing an image of a line segment. I’m sure there’s a better way to do this, but it seems to work fine. The parameters of the kernel are the angle, the length of the line segment and the size of the kernel.
Let’s try to apply the method on a picture of some cows below:
Unfortunately our current method doesn’t work well with this kind of point spread function. The nuclear norm of the NLSS matrices is very noisy. I first thought this could be because the PSF doesn’t change continuously with the length of the line segment. But I ruled this out by hard-coding a diagonal line segment in such a way that it changes continuously, and it looks just as bad.
Instead it seems that the (non-blind) deconvolution method itself doesn’t work well for this kernel. Below we see the image blurred with a length 5 diagonal motion blur, and then deconvolved with different values. With the Gaussian blur we only saw significant deconvolution artifacts if we try to oversharpen an image. Here we see very significant artifacts even if the length parameter is less than 5. I think this is because the point spread function is very discontinuous, and hence its Fourier transform is very irregular.
Additionally, the effect of motion blur on edges is different than that of Gaussian blur. If the edge is parallel to the motion blur, it is not affected or even enhanced. On the other hand, if an edge is orthogonal to the direction of motion blur, the edge is destroyed quickly. This may mean that the sparse gradient prior is not as effective as for Gaussian blur. We have no good way to check this however before improving the deconvolution method.
Having a good image prior is vital for blind deconvolution. Making a good image prior is however quite difficult. Most image priors are based on the idea that natural images have sparsely distributed gradients. We observed that the simple and easy-to-compute \(\ell_1/\ell_2\) prior does a decent job, but isn’t quite good enough. The more complex NLSS nuclear norm prior does a much better job. Using this prior we can do partially blind deconvolution, sharpening an image blurred with Gaussian blur.
However, another vital ingredient for blind deconvolution is good non-blind deconvolution. The current non-blind deconvolution method we introduced in the last part doesn’t work well for non-continuos or sparse point spread functions. There are also problems with artifacts at the boundaries of the image (which I have hidden for now by essentially cheating). This means that if we want to do good blind deconvolution, we first need to revisit non-blind deconvolution and improve our methods.
]]>