Solving large linear systems in real-time applications
In many practical examples solving large linear systems in real-time is the key issue. This is the case, e.g., for our ROMSOC project “Real Time Computing Methods for Adaptive Optics”, where we are dealing with a rapidly changing atmosphere of the earth. Real-time usually refers to a time frame within milliseconds. For our specific test configuration, it is 2 ms.
Before talking about efficient solvers for linear systems we need to define what is a linear system and what is large. A linear system is a system of linear equations Ax = b, where A is a matrix of size n x n, b is the right hand side vector of size n and x is the desired solution. In this context, large refers to an n up to several hundred thousand. The larger the size of the matrix, the harder the system is to solve in real-time. Note, that in our setting we are dealing with a dense square matrix A.
Now that we have stated the mathematical problem formulation, we need to fix some performance criteria to compare the different ways to solve a linear system. What does it mean to have a good or efficient solver? There are plenty of ways to solve such systems, either directly or iteratively. However, there are some quantities that can be checked in advance no matter what specific solver you want to analyze in order to be able to choose a suitable one. In the end, efficient always refers somehow to the run-time of the algorithm. But the run-time heavily depends on the hardware and the implementation, which is usually not done in a preliminary state. Hence, it would be nice to have some criteria that can be checked in advance before implementing the whole method.
In mathematics, solvers are often compared using the computational complexity, which simply counts the number of operations needed to solve the problem. Usually, this quantity is given asymptotically in the order of n. This becomes clearer if we look at some examples: Adding two vectors of length n is of linear complexity O(n), because we need to perform n additions to obtain the result. A matrix vector multiplication is of quadratic complexity O(n2). However, this criterion does not take into account if a matrix is sparse or dense. Note that we say a matrix is sparse, if it has a considerable number of zero entries. A more precise measure in this case is the number of floating point operations (FLOPs). When performing a dense matrix vector multiplication, we require 2n2-n FLOPs to get the solution, however, a sparse matrix vector multiplication needs only 2nnz2-nnz FLOPS. Here, nnz refers to the non-zero entries of A. The above two quantities are suitable for a first, theoretical performance analysis. A next step is to look at parallelization possibilities of the method. We call a method parallelizable, if the whole algorithm or at least the main parts can be executed in parallel, i.e., without a dependency on intermediate results. The big advantage here is that the run-time is considerably reduced while the number of FLOPs stays the same. Most of the real-time applications require massively parallel algorithms in order to be able to solve the problem in the desired time frame. The efficiency of parallelization depends on the hardware in use. GPUs have an enormous computational throughput compared to CPUs and perform extremely well for highly parallelizable algorithms. Some applications can also benefit a lot from a so called matrix-free representation, i.e., the matrix A is formulated as a linear function rather than a sparse or dense matrix. Let us illustrate this on the example of the matrix A=vvT. Saving the matrix requires n2 units of memory and performing a matrix vector multiplication with a vector x needs 2n2-n FLOPs. However, if we save only the vector v instead of the whole matrix this requires just n units of storage and multiplying x first by vT and then by v needs only 3n FLOPs. In this example we used the term memory usage, which denotes the units of memory required to store a matrix or vector. This is another important criterion, since the memory of hardware is limited and storing matrices can become quite memory intensive.
Altogether, we listed here 5 criterions based on whom you can decide for a solver how well it is suited for your application. Nevertheless, in the end you have to implement and optimize the algorithm for your specific hardware, which can become a very crucial and time demanding task.
About the author
My name is Bernadett Stadler and I studied Industrial Mathematics at the Johannes Kepler University in Linz. Since May 2018 I am part of the ROMSOC program as a PhD student. I am involved in a cooperation of the university in Linz and the company Microgate in Bolzano. My research addresses the improvement of the image quality of extremely large telescopes. The goal is to enable a sharper view of more distant celestial objects.