Sparse Signal Restoration

Introduction

These notes describe an approach for the restoration of degraded signals using sparsity. This approach, which has become quite popular, is useful for numerous problems in signal processing: denoising, deconvolution, interpolation, super-resolution, declipping, etc [6].

We assume that the observed signal y can be written as

y=Hx+n(1)

where x is the signal of interest which we want to estimate, n is additive noise, and H is a matrix representing the observation processes. For example, if we observe a blurred version of xthen H will be a convolution matrix.

The estimation of x from y can be viewed as a linear inverse problem. A standard approach to solve linear inverse problems is to define a suitable objective function J(x) and to find the signalx minimizing J(x). Generally, the chosen objective function is the sum of two terms:

J(x)=D(y,Hx)+λR(x)(2)

where D(y,Hx) measures the discrepancy between y and x and R(x) is a regularization term (or penalty function). The parameter λ is called the regularization parameter and is used to adjust the trade-off between the two terms; λ should be a positive value. On one hand, we want to find a signal x so that Hx is very similar to y; that is, we want to find a signal x which is consistent with the observed data y. For D(y,Hx), we will use the mean square error, namely

D(y,Hx)=∥y−Hx∥₂².(3)

The notation ∥v∥₂² represents the sum of squares of the vector v,

∥v∥₂²:=v₁²+v₂²+⋯+v_N².(4)

Minimizing this D(y,Hx) will give a signal x which is as consistent with y as possible according to the square error criterion. We could try to minimize D(y,Hx) by setting x=H⁻¹y; however, H may not be invertible. Even if H were invertible, it may be very ill-conditioned in which case this solution amplifies the noise, sometimes so much that the solution is useless. The role of the regularization term R(x) is exactly to address this problem. The regularizer R(x) should be chosen so as to penalize undesirable/unwanted behaviour in x. For example, if it is expected that x is a smooth signal, then R(x) should be chosen to penalize non-smooth signals. For example one could set R(x)=∑_n|x(n)−x(n−1)|. (This R(x) is called the `total variation' of x and is zero only when x is a constant-valued signal. The more the signal x varies, the greater is its total variation.)

In these notes, it is assumed that the signal of interest, x, is known to be sparse. That is, x has relatively few non-zero values. For example, x may consist of a few impulses and is otherwise zero. In this case, we should define R(x) to be the number of non-zero values of x. Unfortunately, with R(x) defined as such, the objective function J(x) is very difficult to minimize. First, this R(x) is not differentiable. Second, and more importantly, this R(x) is not a convex function of x, and therefore J(x) will have many local minima. In order to develop robust numerical algorithms for the minimization of J(x), it is best that J(x) be a convex function of x. We would like a function that measures sparsity, but which is also convex. For this reason, when x is known to be sparse, the regularization function R(x) is often chosen to be the ℓ₁-norm, which is defined as

∥x∥₁:=|x₁|+|x₂|+⋯+|x_N|.(5)

Hence, the approach is to estimate x from y by minimizing the objective function

J(x)=∥y−Hx∥₂²+λ∥x∥₁.(6)

This is called an ℓ₁-norm regularized linear inverse problem.

The development of fast algorithm to minimize Equation 6, and related functions, is an active research topic. An early and important algorithm is the iterated soft-thresholding algorithm (ISTA), also called the thresholded-Landweber (TL) algorithm. This algorithm was developed in [9], [4] for the purpose of wavelet-based signal restoration, but the algorithm in a more general form appeared earlier in the optimization literature, see [2]. Since the development of ISTA, other faster algorithms have been introduced for the minimization of Equation 6, for example [8],[13], [2], [7], [1], [10], [7], [5], [14], [11], [12], [3] and others. Some of these algorithms can be used for the minimization of more general or related objective functions as well, not justEquation 6.

These notes describe the derivation of the iterated soft-thresholding algorithm (ISTA). ISTA is a combination of the Landweber algorithm and soft-thresholding (so it is also called the thresholded-Landweber algorithm). The derivation below is based on majorization-minimization (a concept in optimization theory) and on ℓ₁-norm regularized denoising (which leads to soft-thresholding). The derivation uses linear algebra (positive definite matrices) and vector derivatives.

Majorization-Minimization

Majoriziation-minimization (MM) replaces a difficult minimization problem by a sequence of easier minimization problems. The MM approach generates a sequence of vectors x_k, k=0,1,2,⋯which converge to desired solution. The MM approach is useful for the minimization of Equation 6 because J(x) can not be easily minimized. There is no formula for the vector x that minimizes J(x).

The majoriziation-minimization idea can be described as follows. Suppose we have a vector x_k, a `guess' for the minimum of J(x). Based on x_k, we would like to find a new vector, x_k+1which further decreases J(x); that is, we want to find x_k+1 such that

J(x_k+1)<J(x_k).(7)

The MM approach asks us first to choose a new function which majorizes J(x) and, second, that we minimize the new function to get x_k+1. MM puts some requirements on this new function, call it G(x). We should choose G(x) such that G(x)≥J(x) for all x (that is what it means that G(x) majorizes J(x)). In addition G(x) should equal J(x) at x_k. We findx_k+1 by minimizing G(x). For this method to be useful, we should choose G(x) to be a function we can minimize easily. The function G(x) will be different at each iteration, so we denote it G_k(x).

Figure 1: Illustration of majorization-minimization (MM) procedure. Plot (a) illustrates the function to be minimized, J(x), and a `guess' of the minimizer, x_k. Plot (b) illustrates the majorizing function G_k(x) which coincides with J(x) at x=x_k but is otherwise greater than J(x). The minimizer of G_k(x), shown as a green dot in (b), is taken as x_k+1. Plot (c) illustrates the subsequent iteration. In (c) the majorizer coincides with J(x) at x=x_k+1. The iterates x_k converge to the minimum of J(x). The MM procedure is useful when J(x) is difficult to minimize but each majorizer G_k(x) is easy to minimize.

(a)

(b)

(c)

Figure 1 illustrates the MM procedure with a simple example. For clarity, the figure illustrates the minimization of a function of one variable. However, the MM procedure works in the same way for the minimization of multivariate functions, and it is in the multivariate case where the MM procedure is especially useful for the minimization of multivariate functions. (In the following sections, MM will be used for the minimization of multivariate functions.)

To summarize, the majorization-minimization algorithm for the minimization of a function J(x) is given by the following iteration:

Set k=0. Initialize x₀.
Choose G_k(x) such that
1. G_k(x)≥J(x) for all x
2. G_k(x_k)=J(x_k)
Set x_k+1 as the minimizer of G_k(x).
Set k=k+1 and go to step (2.)

More details about the MM procedure are provided in [8] and the references therein.

The Landweber Iteration

Although we are interested in minimizing J(x) in Equation 6, let us consider first the minimization of the simpler objective function

J(x)	=∥y−Hx∥₂²
	=(y−Hx)^t(y−Hx)
	=y^ty−2y^tHx+x^tH^tHx.

(8)

Because J(x) in Equation 8 is differentiable and convex, we can obtain its minimizer by setting the derivative with respect to x to zero. The derivative of J(x) is given by

∂

∂x

J(x)=−2H^ty+2H^tHx.(9)

Setting the derivative to zero gives a system of linear equations:

∂

∂x

J(x)=0⇒(H^tH)x=H^ty.(10)

So the minimizer of J(x) in Equation 8 is given by

x=(H^tH)⁻¹H^ty.(11)

Therefore, J(x) in Equation 8 can be minimized by solving a linear system of equations. However, we may not be able to solve these equations easily. For example, if x is a very long signal, then H will be very large matrix and solving the system of equations may require too much memory and computation time. Additionally, the matrix H^tH might not be invertible, or it may very ill-conditioned.

By using the majorization-minimization (MM) approach to minimize J(x) in Equation 8 we can avoid solving a system of linear equations. As described in "Majorization-Minimization", at each iteration k of the MM approach, we should find a function G_k(x) that coincides with J(x) at x_k but otherwise upper-bounds J(x). We should choose a majorizer G_k(x) which can be minimized more easily (without having to solve a system of equations).

We will find a function G_k(x) that majorizes J(x) by adding a non-negative function to J(x),

G_k(x)=J(x)+non-negativefunctionofx.(12)

In order that G_k(x) coincides with J(x) at x=x_k, the non-negative function we add to J(x) should be equal to zero at x_k. With this in mind, we choose G_k(x) to be:

G_k(x)=∥y−Hx∥₂²+(x−x_k)^t(αI−H^tH)(x−x_k).(13)

The function we have added to J(x) is clearly zero at x_k so we have G_k(x)=J(x_k) as desired.

To ensure the function we have added to J(x) is non-negative for all x, the scalar parameter α must be chosen to be equal to or greater than the maximum eigenvalue of H^tH,

α≥maxeig(H^tH).(14)

Then the matrix αI−H^tH is a positive semi-definite matrix, meaning that

v^t(αI−H^tH)v≥0∀v.(15)

Now, following the MM procedure, we need to minimize G_k(x) to obtain x_k+1. Expanding G_k(x) in Equation 13 gives

G_k(x)	=y^ty−2y^tHx+x^tH^tHx+(x−x_k)^t(αI−H^tH)(x−x_k)
	=y^ty+x_k^t(αI−H^tH)x_k−2(y^tH+x_k^t(αI−H^tH))x+αx^tx.

(16)

Note that the quadratic term in Equation 16 is simply αx^tx instead of x^tH^tHx as it was in (Reference). Therefore we can minimize G_k(x) more easily:

∂

∂x

G_k(x)=−2H^ty−2(αI−H^tH)x_k+2αx(17)

∂

∂x

G_k(x)=0⇒x=x_k+

H^t(y−Hx_k)(18)

Hence, we obtain through the MM procedure, the update equation

x_k+1=x_k+

H^t(y−Hx_k)

(19)

which is known as the `Landweber' iteration. It is guaranteed that J(x) in Equation 8 is decreased with each iteration, due to the properties of the MM procedure. Note that theEquation 19 does not require the solution to a linear system of equations. It only requires multiplying by H and by H^t.

To emphasize the functional dependence of G_k(x) on x we can write G_k(x) in Equation 16 as

G_k(x)=α(−2b^tx+x^tx)+c(20)

where

[H^t,y,+,(αI−H^tH),,x_k]

=x_k+

H^t(y−Hx_k)

(21)

and where c consists of the first two terms of Equation 16 which do not depend on x. Note that for any vectors b and x,

b^tb−2b^tx+x^tx=∥b−x∥₂²,(22)

hence from Equation 20 we can write G_k(x) as

G_k(x)=α∥b−x∥₂²−αb^tb+c.(23)

That is, using Equation 21, the majorizer G_k(x) can be written as

G_k(x)=α∥x_k,+,

,H^t,(y−Hx_k),−,x∥₂²+K(24)

where K does not depend on x. This shows that G_k(x) is shaped like a bowl with circular level sets.

Soft-Thresholding

As in "The Landweber Iteration", let us consider the minimization of an objective function simpler than Equation 6. Namely, let us consider the minimization of the function

J(x)=∥y−x∥₂²+λ∥x∥₁.(25)

This is a special simple case of Equation 6 with H=I.

The function J(x) in Equation 25 is convex, but not differentiable (because ∥x∥₁ is not differentiable). Nevertheless, the minimizer of this J(x) is given by a simple formula.

Expanding J(x) in Equation 25 gives

J(x)=(y₁−x₁)²+λ|x₁|+(y₂−x₂)²+λ|x₂|+⋯+(y_N−x_N)²+λ|x_N|.(26)

Note that the variables, x_i, are uncoupled. That is, the function J(x) can be minimized by minimizing each term (y_i−x_i)²+λ|x_i| individually to get x_i, for 1≤i≤N. Therefore we need only consider the scalar minimization of the function

f(x)=(y−x)²+λ|x|.(27)

Taking the derivative,

f^'(x)=−2(y−x)+λsign(x),(28)

setting f^'(x)=0 gives

y=x+

sign(x).(29)

Solving for x gives the graph shown in Figure 2 with threshold λ/2. That is, the minimizer of f(x) is obtained by applying the soft-threshold rule to y with threshold λ/2.

The soft-threshold rule is the the non-linear function defined as

soft(x,T):={

x+T	x≤−T
0	\|x\|≤T
x−T	x≥T

(30)

or more compactly, as

soft(x,T):=sign(x)max(0,|x|−T).(31)

The soft-threshold rule is illustrated in Figure 2. In terms of the soft-threshold rule, the minimization of Equation 27 is given by

x=soft(y,λ/2).(32)

**Figure 2:** The soft-threshold rule Equation 31 with threshold T=2.

Because the variables in the function J(x) in Equation 25 are uncoupled and the solution is obtained by minimizing with respect to each x_i individually, the minimizer of J(x) is obtained by applying the soft-thresholding rule to each element of x:

x=soft(y,,,,λ,/,2).

(33)

The minimization of Equation 25 does not require an iterative algorithm. It is minimized simply by soft-thresholding each element of y with threshold λ/2.

Iterated Soft-Thresholding Algorithm (ISTA)

In this section, we consider the minimization of the objective function J(x) in Equation 6 which we repeat here,

J(x)=∥y−Hx∥₂²+λ∥x∥₁.(34)

This J(x) can be minimized by combining the results of Sections "The Landweber Iteration" and "Soft-Thresholding".

Applying the MM approach, we wish to find a majorizer of J(x) which which coincides with J(x) at x_k and which is easily minimized. We can add to J(x) the same non-negative function as in Equation 13 in "The Landweber Iteration",

G_k(x)=J(x)+(x−x_k)^t(αI−H^tH)(x−x_k).(35)

By design, G_k(x) coincides with J(x) at x_k. We need to minimize G_k(x) to get x_k+1. With Equation 34, we obtain

G_k(x)=∥y−Hx∥₂²+(x−x_k)^t(αI−H^tH)(x−x_k)+λ∥x∥₁(36)

which is the same as Equation 13 except here we have the additional term λ∥x∥₁. Using Equation 24, we can write G_k(x) as

G_k(x)=α∥x_k,+,

,H^t,(y−Hx_k),−,x∥₂²+λ∥x∥₁+K(37)

where K is a constant with respect to x. Minimizing G_k(x) is equivalent to minimizing (1/α)G_k(x), so x_k+1 is obtained by minimizing

∥x_k,+,

,H^t,(y−Hx_k),−,x∥₂²+

∥x∥₁.(38)

We omit the additive constant term because the vector x minimizing G_k(x) is not influenced by it.

Note that Equation 38 has exactly the same form as Equation 25 which is minimized by the formula Equation 33. Therefore, minimizing Equation 38 is achieved by the following soft-thresholding equation:

x_k+1=soft(x_k,+,

,H^t,(y−Hx_k),,,,

2α

)

(39)

where α≥maxeig(H^tH). This gives the iterated soft-thresholding algorithm (ISTA).

The MATLAB program ista implements the iterated soft-thresholding algorithm.

function [x,J] = ista(y,H,lambda,alpha,Nit)
% [x, J] = ista(y, H, lambda, alpha, Nit)
% L1-regularized signal restoration using the iterated
% soft-thresholding algorithm (ISTA)
% Minimizes J(x) = norm2(y-H*x)^2 + lambda*norm1(x)
%  INPUT
%   y - observed signal
%   H - matrix or operator
%   lambda - regularization parameter
%   alpha - need alpha >= max(eig(H'*H))
%   Nit - number of iterations
% OUTPUT
%   x - result of deconvolution
%   J - objective function
 
J = zeros(1, Nit);       % Objective function
x = 0*H'*y;              % Initialize x
T = lambda/(2*alpha);
for k = 1:Nit
    Hx = H*x;
    J(k) = sum(abs(Hx(:)-y(:)).^2) + lambda*sum(abs(x(:)));
    x = soft(x + (H'*(y - Hx))/alpha, T);
end

Example

Figures Figure 3 and Figure 4 illustrate an example of the ℓ₁-norm regularized signal restoration procedure.

A clean sparse signal, x(n), of 100 samples in duration is illustrated in Figure 3. The observed signal is obtained by convolving x(n) by the impulse response

h=[1,2,3,4,3,2,1]/16(40)

and adding white Gaussian noise with standard deviation 0.05.

The result of 500 iterations of ISTA is illustrated in Figure 4. The decay of the objective function J(x) is also illustrated. The restored signal is not exactly the same as the original signal, however, it is quite close.

To implement this example in MATLAB, we used the following commands:

   h = [1 2 3 4 3 2 1]/16;
   N = 100;
   H = convmtx(h',N);
   lambda = 0.1;
   alpha = 1;
   Nit = 500;
   [x, J] = ista(y, H, lambda, alpha, Nit);

**Figure 3:** Example 1: The sparse signal x and the observed signal y=h*x+noise.

**Figure 4:** Example 1: The estimated signal obtained by minimizing the objective function Equation 6. The objective function J(x) versus iteration number.

Convolution and its transpose

Convolution can be represented by matrix-vector multiplication:

Hx=[

h₀
h₁	h₀
h₂	h₁	h₀
	h₂	h₁	h₀
		h₂	h₁	h₀
			h₂	h₁
				h₂

][

x₀

x₁

x₂

x₃

x₄

].(41)

In MATLAB, we can use conv(h,x) to implement Equation 41. We do not need to perform matrix-vector multiplication. Using conv instead of the matrix H is preferable in a computer program so as to avoid storing the matrix (the matrix H will be very large when the signal x is long).

The matrix H^t is similar but not identical:

H^tg=[

h₀	h₁	h₂
	h₀	h₁	h₂
		h₀	h₁	h₂
			h₀	h₁	h₂
				h₀	h₁	h₂

][

g₀

g₁

g₂

g₃

g₄

g₅

g₆

].(42)

This is a truncation of convolution with a reversed version of h:

[

h₂
h₁	h₂
h₀	h₁	h₂
	h₀	h₁	h₂
		h₀	h₁	h₂
			h₀	h₁	h₂
				h₀	h₁	h₂
					h₀	h₁
						h₀

][

g₀

g₁

g₂

g₃

g₄

g₅

g₆

].(43)

So we can implement f=H^tg in MATLAB using the program convt.

function f = convt(h,g);
% f = convt(h,g);
% Transpose convolution: f = H' g
Nh = length(h);
Ng = length(g);
f = conv(h(Nh:-1:1), g);
f = f(Nh:Ng);

The MATLAB program ista_fns uses function handles for H and H^t instead of matrices, so it is not necessary to create the matrix H in MATLAB. Instead, it is necessary to supply a function for each of H and H^t.

function [x,J] = ista_fns(y,H,Ht,lambda,alpha,Nit)
% [x, J] = ista_fns(y, H, lambda, alpha, Nit)
% L1-regularized signal restoration using the iterated
% soft-thresholding algorithm (ISTA)
% Minimizes J(x) = norm2(y-H*x)^2 + lambda*norm1(x)
%  INPUT
%   y - observed signal
%   H - function handle
%   Ht - function handle for H'
%   lambda - regularization parameter
%   alpha - need alpha >= max(eig(H'*H))
%   Nit - number of iterations
% OUTPUT
%   x - result of deconvolution
%   J - objective function
 
J = zeros(1, Nit);       % Objective function
x = 0*Ht(y);             % Initialize x
T = lambda/(2*alpha);
for k = 1:Nit
    Hx = H(x);
    J(k) = sum(abs(Hx(:)-y(:)).^2) + lambda*sum(abs(x(:)));
    x = soft(x + (Ht(y - Hx))/alpha, T);
end

For example, in the Example where H represents convolution, we can specify function handles as follows:

   h = [1 2 3 4 3 2 1]/16;
   H = @(x) conv(h,x);
   Ht = @(y) convt(h,y);
   lambda = 0.1;
   alpha = 1;
   Nit = 500;
   [x, J] = ista_fns(y, H, Ht, lambda, alpha, Nit);

The result is exactly the same as using the ista program.

Fast ISTA

To complete: compare with FISTA [2].

Conclusion

To be completed....

Vector Derivatives

Suppose x is an N-point vector,

x=[

x₁

x₂

⋮

x_N

],(44)

then the derivative of a function f(x) with respect to x is the vector of derivatives,

∂

∂x

f(x)=[

∂f(x)

∂x₁

∂f(x)

∂x₂

⋮

∂f(x)

∂x_N

].(45)

By direct calculation, we have

∂

∂x

b^tx=b.(46)

Suppose that A is a symmetric real matrix, A^t=A. Then, by direct calculation, we also have

∂

∂x

x^tAx=2Ax(47)

and

∂

∂x

(y−x)^tA(y−x)=2A(x−y).(48)

REFERENCES

Bioucas-Dias, J. M. and Figueiredo, M. A. T. (2007, December). A New TwIST: Two-Step Iterative Shrinkage/Thresholding Algorithms for Image Restoration. IEEE Trans. on Image Processing, 16(12), 2992-3004.
Beck, A. and Teboulle, M. (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J. Imag. Sci., 2(1), 183-202.
Combettes, P. L. and Pesquet, J.-C. (2007). Proximal Thresholding Algorithm for Minimization over Orthonormal Bases. SIAM J. on Optimization, 18(4), 1351-1376.
Daubechies, I. and Defriese, M. and Mol, C. De. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math, LVII, 1413-1457.
Daubechies, I. and Fornasier, M. and Loris, I. (2008, December). Accelerated Projected Gradient Method for Linear Inverse Problems with Sparsity Constraints. Journal of Fourier Analysis and Applications, 14, 764-792.
Daubechies, I. and Teschke, G. and Vese, L. (2008). On some iterative concepts for image restoration. In Advances in Imaging and Electron Physics. (Vol. 150, pp. 2-51). Elsevier.
Elad, M. and Matalon, B. and Zibulevsky, M. (2007). Coordinate and subspace optimization methods for linear least squares with non-quadratic regularization. J. of Appl. and Comp. Harm. Analysis, 23, 346-367.
Figueiredo, M. and Bioucas-Dias, J. and Nowak, R. (2007, December). Majorization-minimization algorithms for wavelet-based image restoration. IEEE Trans. on Image Processing, 16(12), 2980-2991.
Figueiredo, M. and Nowak, R. (2003, August). An EM algorithm for wavelet-based image restoration. IEEE Trans. on Image Processing, 12(8), 906-916.
Figueiredo, M. A. T. and Nowak, R. D. and Wright, S. J. (2007, December). Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems. IEEE J. Sel. Top. Signal Process., 1(4), 586-598.
Goldstein, T. and Osher, S. (2009). The Split Bregman Method for L1-Regularized Problems. SIAM J. Imag. Sci., 2(2), 323-343.
Hale, E. T. and Yin, W. and Zhang, Y. (2008). Fixed-Point Continuation for -Minimization: Methodology and Convergence. SIAM J. on Optimization, 19(3), 1107-1130.
Wright, S. J. and Nowak, R. D. and Figueiredo, M. A. T. (2009, July). Sparse Reconstruction by Separable Approximation. IEEE Trans. on Signal Processing, 57(7), 2479-2493.
Wang, Y. and Yang, J. and Yin, W. and Zhang, Y. (2008). A new alternating minimization algorithm for total variation image reconstruction. SIAM J. on Imaging Sciences, 1(3), 248-272.