Skip to main content

Theory and Modern Applications

Convergence of gradient descent for learning linear neural networks

Abstract

We study the convergence properties of gradient descent for training deep linear neural networks, i.e., deep matrix factorizations, by extending a previous analysis for the related gradient flow. We show that under suitable conditions on the stepsizes gradient descent converges to a critical point of the loss function, i.e., the square loss in this article. Furthermore, we demonstrate that for almost all initializations gradient descent converges to a global minimum in the case of two layers. In the case of three or more layers, we show that gradient descent converges to a global minimum on the manifold matrices of some fixed rank, where the rank cannot be determined a priori.

1 Introduction

Deep learning is arguably the most widely used and successful machine learning method, which has led to spectacular breakthroughs in various domains such as image recognition, autonomous driving, machine translation, medical imaging and many more. Despite its widespread use; the understanding of the mathematical principles of deep learning is still in its early stage, and has not yet been fully developed. Particular widely open questions concern the convergence properties of commonly used (stochastic) gradient descent (S)GD algorithms for learning a deep neural network from training data: Does (S)GD always converge to a critical point of the loss function? Does it converge to a global minimum? Does the network learned via (S)GD generalize well to unseen data? We contribute to the first two questions in the case of GD for linear neural networks.

To approach these questions, we study gradient descent for learning a deep linear network, i.e., a network with activation function being the identity, or in other words, learning a deep matrix factorization. While linear neural networks are not expressive enough for most practical applications, the theoretical study of gradient descent for linear neural networks is highly nontrivial and, therefore, expected to be very valuable. The difficulty in deriving mathematical convergence guarantees results from the minimizing functional being non-convex in terms of the individual matrices in the factorization. We are convinced that the case of linear networks should be well-understood before passing to the more difficult (but more practically relevant) case of nonlinear networks. We expect that some principles (though not all) will carry over to the nonlinear case, and the mathematical analysis of the linear case will provide valuable insights.

This article is a continuation of the work started in [5], where a theoretical analysis of the gradient flow related to learning a deep linear network via minimization of the square loss has been studied. Extending earlier contributions [2, 3, 7], it was shown in [5] that gradient flow always converges to a critical point of the square loss. Moreover, for almost all initializations, it converges to a global minimizer in the case of two layers. It is conjectured that this result also holds for more than two layers, but currently, it is only shown in [5] that for more layers, gradient flow converges to the global minimum of the loss function restricted to the manifold of matrices of some fixed rank k for almost all initializations, where unfortunately the result does not allow to determine k a priori.

We note here that the square loss in connection with linear networks has the nice property that all local minimizers are global, see [18], so that our analysis boils down to proving that (strict) saddle points are avoided almost surely. This remarkable property of the square loss is very specific and connected to the notion of Euclidean distance degree and properties of the manifold of fixed rank matrices, see [26, Appendix A.2] for more details.

As another interesting discovery, [5] considers the flow of the product matrix resulting from the gradient flow for the individual matrices in the factorization and identifies this flow of the product matrix as a Riemannian gradient flow. More precisely, the flow of the product matrix takes place on the manifold of matrices of a fixed rank k with respect to a nontrivial and explicitly given Riemannian metric on that manifold. This result requires that at initialization, the tuple of individual matrices is balanced, a term that the authors of [2] introduced. It is important to note that balancedness is preserved by the gradient flow, i.e., this property is related to the natural invariant set of the flow.

In this article, we extend the convergence analysis in [5] from gradient flow to gradient descent. Under certain conditions on the stepsizes, we show that the gradient descent iterations converge to a critical point of the square loss function. Moreover, for almost all initializations, our convergence is towards a global minimum in the case of two layers, while for more than two layers, we obtain the analog of the main result in [5] that for almost all initializations, the product matrix converges to a global minimum of the square loss restricted to the manifold of rank k matrices for some k.

We believe that the extension of the analysis from the gradient flow case to gradient descent is an important step, which turned out to be much more involved than one might initially expect. In fact, there are many works related to the convergence analysis of (stochastic) gradient descent methods in both convex and non-convex situations see, for instance, [17, 21, 22] and references therein. However, we are not aware of any results that are directly applicable to our setting of deep linear networks (and also not to most nontrivial setups for nonlinear networks). In fact, it is common to assume a loss function with Lipschitz gradient. However, due to factorization of the layers and unbounded domain, such Lipschitz assumptions will not be satisfied. Note that our analysis shows the boundedness of all the iterates so that we could, in principle, restrict to a bounded domain, but this needs to be shown first, which is a major part of this work. Hence, our analysis required work without such Lipschitz gradient assumptions and, therefore, may be of independent interest. Moreover, the existing gradient flow analysis in [3, 5] does not provide any hint on conditions on the stepsizes that ensure convergence of its discrete version gradient descent, which is another reason why we believe that our work can be of value.

The difficulties in establishing the extension of the gradient flow analysis to the gradient descent are due to the fact that the gradient descent iterations no longer satisfy exactly the invariance property related to the balancedness. This property of the gradient flow, however, was heavily used in the convergence proof in [5]. In order to circumvent this problem, we develop an induction argument inspired by the article [11], which covers the significantly simpler special case of two layers. The induction proof tracks, in particular, how much the balancedness condition is perturbed during the iterations. In fact, such perturbations stay bounded under suitable assumptions on the stepsizes. In particular, this allows for the bounding of all the individual factors in the linear network.

Learning linear networks are currently also studied in the context of the so-called implicit bias of gradient descent and gradient flows [2, 8, 14, 15, 19, 24, 28, 30]. We expect that the convergence analysis of gradient descent performed in our paper will also be a useful tool for the detailed analysis of the implicit bias of (stochastic) gradient descent in learning deep overparameterized neural networks.

1.1 Relation to previous work

For the scenario of learning deep linear networks, works done in [2, 6, 13, 2729, 31] study the convergence of gradient descent. The authors of [13] provided a guarantee of convergence to global minimizers for gradient descent with random balanced near-zero initialization. Their proof proceeds by transferring the convergence properties of gradient flow to gradient descent. In contrast, based on the Lojasiewicz theorem, we directly prove that gradient descent converges to a critical point of the square loss of deep linear networks. Then we extend the result in [5] that for almost all initializations gradient descent converges to the global minimum for networks of depth 2. For three or more layers, we prove that gradient descent converges to a global minimum on a manifold of a fixed rank. The convergence result in [13] is restricted to a simple scalar regression problem with near-zero initialization and constant stepsize, whereas our result works for the general multivariate case, almost all initializations and not necessarily constant stepsize. Under certain conditions, convergence of the stochastic (sub)gradient method to a critical point has been established in [9]. This result requires the subgradient sequence to be bounded and the cost function to be strictly decreasing along any trajectory of the differential inclusion proceeding from a noncritical point. In addition, the authors of [9] comment that the boundedness of the iterates may be enforced by assuming that the constraint set on which the set valued map is defined is bounded or by a proper choice of a regularizer. In contrast, we do not require these conditions. We rather prove the boundedness of the gradient descent sequence and demonstrate the strong descent condition of this sequence. The authors of [10, 16] address a multivariate regression problem and prove that gradient descent with Gaussian resp. orthogonal random initialization and constant stepsize converges to a global minimum. The result in [16] requires that the hidden layer dimension should be greater than the dimension of the input data with orthogonal initialization, and the one in [10] assumes that the hidden layer dimension is greater than the dimension of the output data. Compared to these results, our result is more general in the sense that it does not require these conditions which exclude some important models such as auto-encoders where the dimensions of the intermediate layers are commonly less than the input and output dimensions. Moreover, our result does not require the initialization to be close enough to a global minimum (as in [2]), and the maximum allowed stepsize in Theorem 2.4 does not decay exponentially with depth (Remark 2.5(b)). In this sense, our theorem is less restrictive.

Our article is structured as follows. Section 2 introduces deep linear networks and gradient descent, recalls the recent results from [5] on gradient flows, and presents our two main results on convergence to a critical point and convergence to a global minimizer for almost all initializations. Section 3 provides the proof of convergence to critical points (in the sense described above), while Sect. 4 is dedicated to the proof of convergence to global minimizers. Finally, Sect. 5 presents numerical experiments illustrating our results.

1.2 Notation

The standard \(\ell _{p}\)-norm on \(\mathbb{R}^{d}\) will be denoted by \(\|x\|_{p} = (\sum_{j=1}^{d} |x_{j}|^{p})^{1/p}\) for \(1 \leq p < \infty \). We write the spectral norm on \(\mathbb{R}^{d \times m}\) as \(\|A\| = \max_{\|x\|_{2}=1} \|A x\|_{2} = \sigma _{\max}(A)\), where \(\sigma _{\max}(A)\) is the largest singular value of A. Moreover, we let \(\sigma _{\min}(A) = \min_{\|x\|_{2}=1} \|Ax\|_{2}\) be the smallest singular value of A. The trace of a matrix A is denoted as \(\operatorname{tr}(A)\), and its Frobenius norm is defined as \(\|A\|_{F} = \sqrt{\operatorname{tr}(A^{T} A)} = \sqrt{ \sum_{j,k} |A_{j,k}|^{2}}\). We will often combine matrices \(W_{1},\ldots ,W_{N}\) into a tuple \(\overrightarrow{W}=(W_{1},\ldots ,W_{N})\). We define the Frobenius inner product of two such tuples \(\overrightarrow{W}\) and \(\overrightarrow{V}\) as \(\langle \overrightarrow{W}, \overrightarrow{V} \rangle _{F} = \sum_{j=1}^{N} \operatorname{tr}(W_{j}^{T} V_{j})\) and the corresponding Frobenius norm as \(\|\overrightarrow{W}\|_{F} = \sqrt{\langle \overrightarrow{W}, \overrightarrow{W}\rangle _{F}} = ( \sum_{j=1}^{N} \|W_{j}\|_{F}^{2} )^{1/2}\). The operator norm of a mapping \(\mathcal{A}\) acting between tuples of matrices will be denoted as \(\| \mathcal{A}\|_{F \to F} = \max_{\|\overrightarrow{W}\|_{F} = 1} \Vert \mathcal{A}(\overrightarrow{W}) \Vert _{F}\). We introduce \([d_{j}]=\{1,2,\dots,d_{j}\}\) with \(d_{j}\in \mathbb{N}\).

2 Linear neural networks and gradient descent analysis

A neural network is a function \(f: \mathbb{R}^{d_{x}} \to \mathbb{R}^{d_{y}}\) of the form

$$ f(x) = f_{W_{1},\ldots ,W_{N},b_{1},\ldots ,b_{N}}(x) = g_{N} \circ g_{N-1} \cdots \circ g_{1}(x), $$

where the so-called layers \(g_{j}:\mathbb{R}^{d_{j-1}} \to \mathbb{R}^{d_{j}}\) are the composition of an affine function with a componentwise activation function, i.e.,

$$ g_{j}(z) = \sigma (W_{j} z + b_{j}),\quad \text{for } W_{j} \in \mathbb{R}^{d_{j} \times d_{j-1}}, b_{j} \in \mathbb{R}^{d_{j}}, $$

where \(\sigma: \mathbb{R}\to \mathbb{R}\) applied to a vector \(w \in \mathbb{R}^{d_{j}}\) acts as \((\sigma (w))_{k} = \sigma (w_{k})\), \(k \in [d_{j}]\). Here, \(d_{0} = d_{x}\) and \(d_{N} = d_{y}\), while \(d_{1},\ldots ,d_{N-1} \in \mathbb{N}\) are some numbers. Prominent examples for activation functions used in deep learning include \(\sigma (t) = \operatorname{ReLU}(t) = \max \{0, t\}\) and \(\sigma (t) = \tanh (t)\), but we will simply choose the identity \(\sigma (t) = t\) in this article.

Learning a neural network \(f=f_{W_{1},\ldots ,W_{N},b_{1},\ldots ,b_{N}}\) consists in adapting the parameters \(W_{j},b_{j}\) based on labeled training data, i.e., pairs \((x_{i},y_{i})\) of input data \(x_{1},\ldots ,x_{m} \in \mathbb{R}^{d_{x}}\) and output data \(y_{1},\ldots ,y_{m} \in \mathbb{R}^{d_{y}}\) in a way that \(f_{W_{1},\ldots ,W_{N},b_{1},\ldots ,b_{N}}(x_{i}) \approx y_{i}\) for \(i \in [m]\). Ideally, the learned neural network f should generalize well to unseen data, i.e., it should predict well the label y corresponding to new input data x. However, we will not discuss this point further in this article.

The learning process is usually performed via optimization. Given a loss function \(\ell: \mathbb{R}^{d_{y}} \times \mathbb{R}^{d_{y}} \to \mathbb{R}_{+}\) (usually satisfying \(\ell (y,y) = 0\)), one aims at minimizing the empirical risk function

$$ \mathcal{L}(W_{1},\ldots ,W_{N},b_{1}, \ldots ,b_{N}) = \sum_{i=1}^{m} \ell \bigl(f_{W_{1},\ldots ,W_{N},b_{1},\ldots ,b_{N}}(x_{i}),y_{i}\bigr) $$

with respect to the parameters \(W_{1},\ldots ,W_{N},b_{1},\ldots ,b_{N}\). Gradient descent and stochastic gradient descent algorithms are most commonly used for this task. Convergence analysis of these algorithms is challenging in general since, due to the compositional nature of neural networks, the function \(\mathcal{L}\) is not convex in general.

Due to this difficulty, we reduce to the special case of linear neural networks in this article, i.e., we assume that \(\sigma (t) = t\) is the identity and that \(b_{j} = 0\) for all j. Consequently, a linear neural network takes the form

$$ f(x) = f_{W_{1},\ldots ,W_{N}}(x)= W_{N} \cdots W_{1} x = W x,\quad \text{where } W = W_{N} \cdot W_{N-1} \cdots W_{1}. $$

While linear networks may not be expressive enough for many applications, convergence properties of gradient descent applied to learning linear neural networks are still nontrivial to understand. We will concentrate on the square-loss \(\ell (z,w) = \frac{1}{2} \|z-w\|_{2}^{2}\) here, so that our learning problem consists in minimizing

$$ L^{N}(W_{1},\dots, W_{N})= \frac{1}{2} \sum _{i=1}^{m} \Vert y_{i} - W_{N} \cdots W_{1} x_{i} \Vert _{2}^{2} = \frac{1}{2} \Vert Y-W_{N}\cdots W_{1} X \Vert ^{2}_{F}, $$

where the data matrix \(X \in \mathbb{R}^{d_{x} \times m}\) contains the data points \(x_{i} \in \mathbb{R}^{d_{x}}\), \(i=1,\ldots ,m\) as columns and likewise the matrix \(Y \in \mathbb{R}^{d_{y} \times m}\) contains the label points \(y_{i} \in \mathbb{R}^{d_{y}}\), \(i=1,\ldots ,m\). The function \(L^{1}\) is given by

$$ L^{1}(W) = \frac{1}{2} \Vert Y - W X \Vert _{F}^{2}. $$

Note that the rank of the matrix \(W = W_{N} \cdots W_{1}\) is at most \(r:= \min_{i=0,\ldots ,N} d_{i}\), which is strictly smaller than \(\min \{d_{x},d_{y}\}\) if one of the “hidden” dimensions \(d_{i}\) is smaller than this number. Hence, we can also view the learning problem as one of minimizing \(L^{1}(W)\) under the constraint \(\operatorname{rank}(W) \leq r\). Instead of directly minimizing over W, we choose an overparameterized representation as \(W=W_{N} \cdots W_{1}\) and consider gradient descent with respect to each factor \(W_{i}\). While overparameterization seems to be a waste of resources at first sight, it also has certain advantages as it can even accelerate convergence [4] (at least for \(\ell _{p}\)-losses with \(p > 2\)) or lead to solutions with better generalization properties [30]. Moreover, we expect that understanding theory for overparameterization in linear neural network will also give insights for overparameterization in nonlinear networks, which is widely used in practice. While the speed of convergence or implicit bias are certainly of interest on their own, we will not delve into this but rather concentrate on mere convergence here.

We consider gradient descent for the loss function \(L^{N}\) with stepsizes \(\eta _{k}\), i.e.,

$$ W_{j}(k+1)=W_{j}(k)-\eta _{k} \nabla _{W_{j}}L^{N}\bigl(W_{1}(k), \dots,W_{N}(k)\bigr). $$
(2.1)

We further define the matrix W at each iteration k by

$$ W(k)=W_{N}(k)\cdots W_{1}(k). $$

Before discussing gradient descent itself, let us recall previous results for the related gradient flow, which will guide the intuition for the analysis in this paper.

2.1 Gradient flow analysis

The gradient flow \(\overrightarrow{W}(t) = (W_{1}(t),\ldots ,W_{N}(t))\), \(t \in \mathbb{R}_{+}\) for the function \(L^{N}\) is defined via the differential equation

$$ \frac{d}{dt} W_{j}(t) = - \nabla _{W_{j}} L^{N}\bigl(W_{1}(t),\ldots ,W_{N}(t)\bigr),\qquad W_{j}(0) = W_{j,0},\quad j = 1,\ldots ,N, $$
(2.2)

for some initial matrices \(W_{j,0} \in \mathbb{R}^{d_{j} \times d_{j-1}}\). This flow represents the continuous analog of the gradient descent algorithm and has been analyzed in [2, 3, 5, 7].

An important invariance property of the gradient flow (2.2) consists in the fact that the differences

$$ W_{j+1}^{T}(t)W_{j+1}(t) - W_{j}(t) W_{j}^{T}(t),\quad j=1,\ldots ,N $$
(2.3)

are constant in time, see [2, 3, 5, 7]. This motivates to call a tuple \(\overrightarrow{W} = (W_{1},\ldots ,W_{N})\) balanced if

$$ W_{j+1}^{T} W_{j+1} = W_{j} W_{j}^{T} \quad\text{for all } j =1, \ldots ,N. $$
(2.4)

If \(\overrightarrow{W}(0) = (W_{1,0},\ldots ,W_{N,0})\) is balanced, then \(\overrightarrow{W}(t)\) is balanced for all \(t \in \mathbb{R}_{+}\) as a consequence of the invariance property. Note that by taking the trace on both sides of (2.4), we see that balancedness implies \(\|W_{j}\|_{F} = \|W_{1}\|_{F}\) for all \(j=1,\ldots ,N\).

It is useful to introduce the “end-to-end” matrix \(W(t) = W_{N}(t) \cdots W_{1}(t)\), which describes the action of the resulting network and is the object of main interest. It was shown in [3] that if the initial tuple \(\overrightarrow{W}(0)\) (and hence \(\overrightarrow{W}(t)\) for any \(t \geq 0\)) is balanced then the dynamics of \(W(t)\) can be described without making use of the individual matrices \(W_{j}(t)\). More precisely, it satisfies the differential equation

$$ \frac{d}{dt} W(t) = - \mathcal{A}_{W(t)}( \nabla L^{1}\bigl(W(t)\bigr), $$
(2.5)

where \(\mathcal{A}_{W}: \mathbb{R}^{d_{x} \times d_{y}} \to \mathbb{R}^{d_{x} \times d_{y}}\) is the linear map

$$ \mathcal{A}_{W}(Z) = \sum_{j=1}^{N} \bigl(W W^{T}\bigr)^{\frac{N-j}{N}} \cdot Z \cdot \bigl(W^{T} W\bigr)^{\frac{j-1}{N}}. $$

One feature of the flow in (2.5), see [5, Theorem 4.5], is that the rank of \(W(t)\) is constant in t, i.e., if \(W(0) = W_{N}(0) \cdots W_{1}(0)\) has rank r then the \(W(t)\) stays in the manifold of rank r matrices for all \(t \geq 0\) (but note that the rank may drop in the limit). This property may fail for non-balanced initializations [5, Remark 4.2]. Another interesting observation (which, however, will not be important in our article) is that (2.5) can be interpreted as Riemannian gradient flow with respect to an appropriately defined Riemannian metric on the manifold of rank r matrices, see [5] for all the details.

The convergence properties of the gradient flow (2.2) (in both the unbalanced and balanced case) can be summarized in the following theorems. The first one from [5, Theorem 3.2] significantly generalizes the main result of [7].

Theorem 2.1

[5, Theorem 3.2] Assume that \(XX^{T}\) has full rank. Then, the flow \(\overrightarrow{W}(t)\) defined by (2.2) is defined and bounded for all \(t \geq 0\) and converges to a critical point of \(L^{N}\) as \(t \to \infty \).

This result is shown via the Lojasiewicz theorem [1], which requires, in turn, to show boundedness of all components \(W_{i}(t)\) of \(\overrightarrow{W}(t)\). While the boundedness is straightforward to show for \(W(t)\), it is a nontrivial property of the \(W_{i}(t)\). In fact, the proof exploits the invariance of the differences in (2.3).

While convergence to a critical point is nice to have, we would like to obtain more information about the type of critical point, whether it is a global or local minimum or merely a saddle point. Note that the function \(L^{N}\) built from the square loss has the nice (but rare) property that a local minimum is automatically a global minimum [18, 26]. This means that we only need to single out saddle points. Also, observe that we cannot expect to have convergence to a global minimizer for any initialization because the flow will not move when initializing in any critical point, so we cannot expect convergence to a global minimizer if that critical point is not already a global minimizer. The following result, valid for almost all initializations, was derived in [5, Theorem 6.12]. In order to state it, we need to introduce the matrix

$$ Q = Y X^{T} \bigl(X X^{T} \bigr)^{-1/2}, $$
(2.6)

assuming that \(XX^{T}\) has full rank.

Theorem 2.2

[5, Theorem 6.12] Assume that \(XX^{T}\) has full rank, let \(q = \operatorname{rank}(Q)\), \(r = \min_{j=0,\ldots ,N} d_{j}\) and \(\bar{r} = \min \{q,r\}\) where Q is the matrix defined in (2.6).

  1. (a)

    For almost all initializations \(\overrightarrow{W}(0)\), the flow (2.2) converges to a critical point \(\overrightarrow{W}^{*}=(W_{1}^{*},\ldots ,W_{N}^{*})\) of \(L^{N}\) such that \(W^{*}:= W_{N}^{*} \cdots W_{1}^{*}\) is a global minimizer of \(L^{1}\) on the manifold of matrices of fixed rank k for some \(0 \leq k \leq \bar{r}\).

  2. (b)

    If \(N=2\), then for almost all initial values \(W_{1}(0),\ldots ,W_{N}(0)\), the flow converges to a global minimizer of \(L^{N}\) on \(\mathbb{R}^{d_{0} \times d_{1}} \times \cdots \times \mathbb{R}^{d_{N-1} \times d_{N}}\).

We conjecture that the statement in part (b) also holds for \(N \geq 3\), or in other words, that we can always choose the maximal possible rank \(k = \bar{r}\) in (a), but unfortunately, the proof method employed in [5] is not able to deliver this extension without making significant adaptations. In fact, the proof relies on an abstract result, see [20] and [5, Theorem 6.3], which states that for almost all initializations, so-called strict saddle points are avoided as limits. Unfortunately, if \(N \geq 3\), then minimizers of \(L^{1}\) restricted to the manifold of matrices of rank \(k < \bar{r}\) may correspond to non-strict saddle points of \(L^{N}\), see [18] and [5, Proposition 6.10], so that the abstract result does not apply to these points.

2.2 Gradient descent analysis

Our main goal is to extend Theorems 2.1 and 2.2 from gradient flow (2.2) to gradient descent (2.1). The balancedness, or more generally, the invariance property, see (2.3), does not appear explicitly in the statements of these theorems for gradient flow, although the invariance property is key in showing boundedness of the flow in the proof of Theorem 2.1. It turns out that balancedness does play an explicit role in the conditions for the stepsizes ensuring convergence. Unfortunately, the invariance of the differences in (2.3) does not carry over to the iterations of gradient descent, which prevents directly following the proof strategy of [5] for showing the boundedness of the iterates. Nevertheless, we will prove that under suitable conditions on the stepsizes, the differences in (2.3) will stay bounded in norm, which then allows us to show the boundedness of the components \(W_{j}(k)\) of \(\overrightarrow{W}(k)\) and to apply Lojasiewicz’ theorem to show convergence to a critical point.

In order to state our main results, we introduce the following definition.

Definition 2.3

We say that a tuple \(\overrightarrow{W} = (W_{1},\ldots ,W_{N})\) has balancedness constant \(\delta \geq 0\) if

$$\begin{aligned} \bigl\Vert W^{T}_{j+1} W_{j+1} -W_{j}W_{j}^{T} \bigr\Vert &\leq \delta \quad\text{for all } j=1,\dots, N-1. \end{aligned}$$
(2.7)

Obviously, (2.7) quantifies how much the tuple \(\overrightarrow{W}\) deviates from being balanced, measured in the spectral norm. Note that the authors of [2] introduced a very similar notion and said \(\overrightarrow{W} = (W_{1},\ldots ,W_{N})\) to be δ-balanced if (2.7) holds with the spectral norm replaced by the Frobenius norm.

The following Theorem 2.4 indicates that GD with approximately balanced initialization converges to a critical point of \(L^{N}\). This theorem provides suitable conditions on the stepsizes that guarantee convergence.

Theorem 2.4

Let \(X \in \mathbb{R}^{d_{x} \times m}, Y \in \mathbb{R}^{d_{y} \times m}\) be data matrices such that \(X X^{T}\) is of full rank. Suppose that the initialization \(\overrightarrow{W}(0)\) of the gradient descent iterations (2.1) has balancedness constant αδ for some \(\delta > 0\) and \(\alpha \in [0,1)\). Assume that the stepsizes \(\eta _{k} > 0\) satisfy \(\sum_{k=0}^{\infty }\eta _{k} = \infty \) and

$$ \eta _{k} \leq \frac{2(1-\alpha ) \delta}{4 L^{N}(\overrightarrow{W}(0)) + (1-\alpha ) \delta B_{\delta}} \quad\textit{for all } k \in \mathbb{N}_{0}, $$
(2.8)

where

$$\begin{aligned} &B_{\delta}:= 2e N K_{\delta}^{N-1} \Vert X \Vert ^{2} + \sqrt{e} N K_{\delta}^{ \frac{N}{2}-1} \bigl\Vert XY^{T} \bigr\Vert , \end{aligned}$$
(2.9)
$$\begin{aligned} &K_{\delta }:= M^{\frac{2}{N}} + (N+1)^{2} \delta, \end{aligned}$$
(2.10)
$$\begin{aligned} &M:= \frac{\sqrt{2L^{N} (\overrightarrow{W}(0) )}+ \Vert Y \Vert }{\sigma _{\min}{ (X )}} = \frac{\sqrt{2} \Vert Y - W_{N}(0) \cdots W_{1}(0) X \Vert _{F}+ \Vert Y \Vert }{\sigma _{\min}{ (X )}}. \end{aligned}$$
(2.11)

Then, the sequence \(\overrightarrow{W}(k)\) converges to a critical point of \(L^{N}\).

The theorem regarding convergence to a critical point of \(L^{N}\) stated above will be proven in the upcoming Sect. 2.4.

Remark 2.5

  1. (a)

    If \(\overrightarrow{W}(0)\) is balanced, i.e., has balancedness constant 0, we can choose \(\alpha = 0\) above. Then, for any \(\delta > 0\), choosing the stepsizes \(\eta _{k}\) such that (3.14) below is satisfied ensures convergence to a critical point and that all the iterates \(\overrightarrow{W}(k)\), \(k \in \mathbb{N}\), have

    δ, see Proposition 3.4. This latter property will be a crucial ingredient for the proof of the theorem.

  2. (b)

    Intuitively, the stepsizes \(\eta _{k}\) should be chosen as large as possible in order to have fast convergence in practice, while it does not seem to be crucial to have the balancedness constant δ as small as possible during the iterations. This suggests maximizing the right-hand side of (2.8) with respect to δ in order to make the condition on the stepsizes as weak as possible. While analytical maximization seems difficult, this may be done numerically in practice. A reasonably good choice for δ seems to be

    $$ \delta = \frac{1}{N(N+1)^{2}} M^{\frac{2}{N}}. $$

    Then, \(K_{\delta }= (1+ \frac{1}{N} ) M^{\frac{2}{N}}\) so that \(K_{\delta}^{N} \leq e M^{2}\) where e is Euler’s constant. Since \(2L^{N}(\overrightarrow{W}(0)) \leq \sigma ^{2}_{\min}(X) M^{2}\), Condition (2.8) is then satisfied if

    $$ \eta _{k} \leq \frac{2}{2(1-\alpha )^{-1}N(N+1)^{2} M^{2-\frac{2}{N}} \sigma _{\min}^{2}(X) + 2e^{2-\frac{1}{N}}N M^{2-\frac{2}{N}} \Vert X \Vert ^{2} + e^{1-\frac{1}{N}} N M^{1-\frac{2}{N}} \Vert XY^{T} \Vert }. $$

    For a network of depth, this means that δ is of the order \(\delta =\mathcal{O}(N^{-2})\), and the stepsizes are required to be of order \(\eta = \mathcal{O}(N^{-3})\).

  3. (c)

    The stepsizes \(\eta _{k}\) in the theorem can be chosen a priori, for instance, \(\eta _{k}=\eta \) (constant stepsize), or \(\eta _{k} = c k^{-\alpha}\) for some \(\alpha \in [0,1)\), or adaptively, i.e., depending on the current iterate \(\overrightarrow{W}(k)\), as long as the stepsize condition (2.8) is satisfied. In practice, it seems that a large constant stepsize leads to the best performance in terms of convergence speed.

Of course, more information on the type of critical point to which \(\overrightarrow{W}(k)\) converges is desirable. Our next theorem states the analog of Theorem 2.2 that essentially convergence is towards global minimizers for almost all initializations. Since Condition (3.14) on the stepsizes \(\eta _{k}\) ensuring mere convergence to a critical point depends on the initialization \(\overrightarrow{W}(0)\), we can only expect to state a result for almost all initializations for sets of tuples \(\overrightarrow{W}\) of matrices for which the balancedness constant δ and M in (3.14) have a uniform upper bound. Consequently, we choose \(\mathcal {B}\subset \mathbb{R}^{d_{0}\times d_{1}}\times \cdots \times \mathbb{R}^{d_{N-1}\times d_{N}}\) to be bounded and let

$$\begin{aligned} & \delta _{\mathcal {B}}= \sup_{\overrightarrow{W}\in \mathcal {B}} \max _{j=1,\ldots ,N-1} \bigl\Vert W_{j+1}^{T} W_{j+1} - W_{j} W_{j}^{T} \bigr\Vert , \end{aligned}$$
(2.12)
$$\begin{aligned} &L_{\mathcal{B}} = \sup_{\overrightarrow{W}\in \mathcal {B}} L^{N}( \overrightarrow{W}),\qquad M_{\mathcal {B}} = \bigl(\sqrt{2 L_{\mathcal{B}}} + \Vert Y \Vert \bigr)\sigma ^{-1}_{\min}{ (X )}. \end{aligned}$$
(2.13)

Note that \(\delta _{\mathcal{B}}\) and \(M_{\mathcal {B}}\) are finite (assuming that \(XX^{T}\) has full rank) since \(L^{N}\) is continuous. Let us also recall the definition of the matrix \(Q = YX^{T} (XX^{T})^{-1/2}\) in (2.6).

Theorem 2.6

Let \(\mathcal {B}\subset \mathbb{R}^{d_{0}\times d_{1}}\times \cdots \times \mathbb{R}^{d_{N-1}\times d_{N}}\) be a bounded set with constants \(\delta _{\mathcal {B}} \leq \alpha \delta \) as in (2.12) for some \(\delta > 0\) and \(\alpha \in [0,1)\) and \(L_{\mathcal{B}}\), \(M_{\mathcal {B}}\) defined by (2.13). Let \(q=\operatorname{rank}(Q)\), \(r=\min \{d_{0},\dots,d_{N}\}\) and \(\bar{r}=\min \{q,r\}\), and let \((\eta _{k})_{k\in \mathbb{N}_{0}}\) be a sequence of positive stepsizes such that

$$ \eta _{k} \leq \frac{2(1-\alpha ) \delta}{4 L_{\mathcal{B}} + (1-\alpha ) \delta B_{\delta}} \quad\textit{for all } k \in \mathbb{N}_{0}, $$
(2.14)

where

$$\begin{aligned} K_{\delta }:= M_{\mathcal {B}}^{\frac{2}{N}} + (N+1)^{2} \delta,\qquad B_{\delta}:= 2e N K_{\delta}^{N-1} \Vert X \Vert ^{2} + \sqrt{e} N K_{ \delta}^{\frac{N}{2}-1} \bigl\Vert XY^{T} \bigr\Vert . \end{aligned}$$

Assume that additionally one of the following conditions is satisfied.

  1. (1)

    The sequence \((\eta _{k})\) is constant, i.e., \(\eta _{k} = \eta \) for some \(\eta > 0\) for all \(k \in \mathbb{N}\).

  2. (2)

    It holds

    $$ \eta _{k} \geq C \frac{1}{k} \quad\textit{for some } C > 0 \quad\textit{and}\quad \lim_{k \to \infty} \eta _{k} = 0. $$

Then, the following statements hold.

  1. (a)

    For almost all initializations \(\overrightarrow{W}(0) =(W_{1}(0),\dots, W_{N}(0))\in \mathcal {B}\), gradient descent (2.1) with stepsizes \(\eta _{k}\) converges to a critical point \(\overrightarrow{W}\) of \(L^{N}\) such that \(W=W_{N}\cdots W_{1}\) is a global minimum of \(L^{1}\) on the manifold \(\mathcal{M}_{k}\) of matrices of rank \(k=\operatorname{rank}(W)\in \{0,1,\dots,\bar{r}\}\) on \(\mathbb{R}^{d_{N}\times d_{0}}\).

  2. (b)

    For \(N=2\), gradient descent (2.1) converges to a global minimum of \(L^{N}\) on \(\mathbb{R}^{d_{0}\times d_{1}}\times \mathbb{R}^{d_{1}\times d_{2}}\) for almost all \(\overrightarrow{W}(0)=(W_{1}(0), W_{2}(0))\in \mathcal {B}\).

The proof of the global convergence theorem stated above can be found in Sect. 4.

Similar to Theorem 2.2, we conjecture that part (b) extends to \(N \geq 3\) or equivalently that part (a) holds with \(k = \bar{r}\). As for Theorem 2.2, the current proof method based on a strict saddle point analysis cannot be extended to show this conjecture.

It is currently not clear whether the theorem holds under more general assumptions on the stepsizes \(\eta _{k}\), i.e., whether it is necessary that one of the two additional conditions on \(\eta _{k}\) holds. The current proof can only handle those two cases, for corresponding abstract results are available, see [20, 23]. It seems crucial for these general results that the stepsizes are chosen a priori and independently of the choice of \(\overrightarrow{W}(0)\) (or the further iterates). In particular, adaptive stepsize choices are not covered by our theorem. We note that the bounds on the stepsizes are reasonable for practical purposes. In particular, the stepsize choices in our numerical experiments meet these bounds.

3 Convergence to critical points

We will prove Theorem 2.4 in this section. For \(\overrightarrow{W} = (W_{1},\ldots ,W_{N})\) will always denote the corresponding product matrix by

$$ W = W_{N} \cdots W_{1}, $$

and similarly, we denote by \(W(k)=W_{N}(k) \cdots W_{1}(k)\) the sequence of product matrices associated to a sequence \(\overrightarrow{W}(k) = (W_{1}(k),\ldots ,W_{N}(k))\), \(k \in \mathbb{N}_{0}\). We recall from [2, 3, 5, 7] that

$$\begin{aligned} &\nabla L^{1} (W )=WXX^{T}-YX^{T}, \end{aligned}$$
(3.1)
$$\begin{aligned} &\nabla _{W_{j}} L^{N}(W_{1},\ldots , W_{N})=W^{T}_{j+1}\cdots W^{T}_{N} \nabla L^{1} (W ) W^{T}_{1}\cdots W^{T}_{j-1}. \end{aligned}$$
(3.2)

3.1 Auxiliary bounds

We start with a useful bound for \(\|W\|\) in terms of \(L^{1}(W)\).

Lemma 3.1

Assume that \(XX^{T}\) has full rank. Then, \(W \in \mathbb{R}^{d_{x} \times d_{y}}\) satisfies

$$ \Vert W \Vert \leq \bigl( \Vert Y - W X \Vert + \Vert Y \Vert \bigr) \sigma _{\min}^{-1}(X) \leq \bigl( \sqrt{2 L^{1}(W)} + \Vert Y \Vert \bigr)\sigma _{\min}^{-1}(X). $$
(3.3)

Consequently, if \(L^{N}(\overrightarrow{W}(k)) \leq L^{N}(\overrightarrow{W}(0))\), then

$$ \bigl\Vert W(k) \bigr\Vert = \bigl\Vert W_{N}(k) \cdots W_{1}(k) \bigr\Vert \leq \bigl(\sqrt{2 L^{N}\bigl( \overrightarrow{W}(0)\bigr)} + \Vert Y \Vert \bigr)\sigma _{\min}^{-1}(X). $$

Furthermore,

$$ \bigl\Vert \nabla L^{1}(W) \bigr\Vert \leq \Vert WX - Y \Vert \Vert X \Vert \leq \sqrt{2 L^{1}(W)} \Vert X \Vert . $$
(3.4)

Proof

Arguing similarly to the proof of [5, Theorem 3.2] gives

$$\begin{aligned} \Vert W \Vert & = \bigl\Vert W XX^{T}\bigl(XX^{T} \bigr)^{-1} \bigr\Vert \leq \Vert WX \Vert \bigl\Vert X^{T}\bigl(XX^{T}\bigr)^{-1} \bigr\Vert \leq \bigl( \Vert Y-WX \Vert + \Vert Y \Vert \bigr) \sigma _{\min}^{-1}(X) \\ & \leq \bigl( \Vert Y-WX \Vert _{F} + \Vert Y \Vert \bigr) \sigma _{\min}^{-1}(X) = \bigl(\sqrt{2 L^{1}(W)} + \Vert Y \Vert \bigr) \sigma _{\min}^{-1}(X). \end{aligned}$$

The second claim follows then as an easy consequence recalling that \(L^{1}(W(k)) = L^{N}(\overrightarrow{W}(k))\).

For the third claim, we use the explicit formula (3.1) for the gradient of \(L^{1}\) to conclude that

$$ \bigl\Vert \nabla L^{1}(W) \bigr\Vert = \bigl\Vert WXX^{T} - YX^{T} \bigr\Vert \leq \Vert WX- Y \Vert \bigl\Vert X^{T} \bigr\Vert \leq \Vert WX - Y \Vert _{F} \Vert X \Vert = \sqrt{2 L^{1}(W)} \Vert X \Vert . $$

This completes the proof. □

A crucial ingredient in our proof is to show the boundedness of all matrices \(W_{j}(k)\), \(k \in \mathbf{N}_{0}\). While boundedness for the product \(W(k)=W_{N}(k) \cdots W_{1}(k)\) follows easily from the previous lemma, it does not immediately imply boundedness of all the factors \(W_{j}(k)\). For instance, multiplying one factor \(W_{j}(k)\) by a constant \(\alpha > 0\) and another factor \(W_{\ell}(k)\) by \(\alpha ^{-1}\) leaves the product \(W(k)\) invariant but changes the norm of \(W_{j}(k)\) and \(W_{\ell}(k)\). In particular, letting \(\alpha \to \infty \) shows that a bound for \(W(k)\) alone does not imply boundedness for \(W_{j}(k)\), \(k \in \mathbb{N}_{0}\). This is where the balancedness comes in. In particular, if a tuple \(\overrightarrow{W}=(W_{1},\ldots ,W_{N})\) has balancedness constant \(\delta \geq 0\), then we can bound \(\|W_{j}\|\), \(j=1,\ldots ,N\), by an expression (continuously) depending on \(\|W\|\). This is the essence of the next statement.

Proposition 3.2

Let \(\overrightarrow{W}=(W_{1},\ldots ,W_{N}) \in \mathbb{R}^{d_{0} \times d_{1}}\times \cdots \times \mathbb{R}^{d_{N-1}\times d_{N}}\) with balancedness constant \(\delta \geq 0\), and let \(W= W_{N} \cdots W_{1}\). Then,

$$ \Vert W_{j} \Vert ^{2} \leq \Vert W \Vert ^{\frac{2}{N}} + (N+1)^{2} \delta \quad\textit{for all } j = 1,\ldots , N. $$

Remark 3.3

With a significantly longer proof, one can improve this result to

$$ \Vert W_{j} \Vert ^{2} \leq \Vert W \Vert ^{\frac{2}{N}} + N^{2} \delta \quad\text{for all } j = 1,\ldots , N. $$

However, since this does not significantly improve our results, we decided to present the slightly weaker bound in order to keep the proof short.

Proof

We will first prove that

$$ \Vert W_{1} \Vert ^{2N} \leq \Vert W \Vert ^{2} + Q_{N,\delta}\bigl( \Vert W_{1} \Vert ^{2}+\delta \bigr), $$
(3.5)

where \(Q_{N,\delta}\) is the polynomial of degree \(N-1\) defined as

$$ Q_{N,\delta}(x) = x(x+\delta ) (x+2\delta ) \cdots \bigl(x+(N-1)\delta \bigr) - x^{N}. $$

In order to prove this claim, we let \(D_{j}:= W_{j-1} W_{j-1}^{T} - W_{j}^{T} W_{j}\) for \(j=2,\ldots ,N\) and note that \(\|D_{j}\| \leq \delta \) by assumption. Moreover,

$$ \Vert W_{j} \Vert ^{2} = \bigl\Vert W_{j}^{T} W_{j} \bigr\Vert = \bigl\Vert W_{j-1} W_{j-1}^{T} - D_{j} \bigr\Vert \leq \Vert W_{j-1} \Vert ^{2} + \delta, \quad\text{for all } j=2,\ldots ,N, $$
(3.6)

and consequently

$$ \Vert W_{j} \Vert ^{2} \leq \Vert W_{1} \Vert ^{2} + (j-1)\delta \quad\text{for } j=1, \ldots ,N. $$
(3.7)

We observe that by basic properties of the spectral norm

$$\begin{aligned} \begin{aligned} \Vert W_{1} \Vert ^{2N} ={}& \bigl\Vert \bigl(W_{1}^{T} W_{1}\bigr)^{N} \bigr\Vert = \bigl\Vert W_{1}^{T} \bigl(W_{1} W_{1}^{T}\bigr)^{N-1} W_{1} \bigr\Vert \\ ={}& \bigl\Vert W_{1}^{T} \bigl(W_{2}^{T} W_{2} + D_{2}\bigr)^{N-1} W_{1} \bigr\Vert \\ \leq{}& \bigl\Vert W_{1}^{T} \bigl(W_{2}^{T} W_{2}\bigr)^{N-1} W_{1} \bigr\Vert \\ &{}+ \sum _{k=0}^{N-2} \binom{N-1}{k} \Vert W_{1} \Vert \bigl\Vert W_{2}^{T} W_{2} \bigr\Vert ^{k} \Vert D_{2} \Vert ^{N-k-1} \Vert W_{1} \Vert \\ \leq{}& \bigl\Vert W_{1}^{T} \bigl(W_{2}^{T} W_{2}\bigr)^{N-1} W_{1} \bigr\Vert \\ &{}+ \Vert W_{1} \Vert ^{2} \Biggl(\sum_{k=0}^{N-1} \binom{N-1}{k} \Vert W_{2} \Vert ^{2k} \delta ^{N-k-1} - \Vert W_{2} \Vert ^{2(N-1)} \Biggr) \\ ={}& \bigl\Vert W_{1}^{T} W_{2}^{T} \bigl(W_{2} W_{2}^{T}\bigr)^{N-2} W_{2} W_{1} \bigr\Vert \\ &{} + \Vert W_{1} \Vert ^{2} \bigl( \bigl( \Vert W_{2} \Vert ^{2} + \delta \bigr)^{N-1} - \Vert W_{2} \Vert ^{2(N-1)} \bigr). \end{aligned} \end{aligned}$$
(3.8)

In the first inequality, we expanded \((W_{2}^{T} W_{2} + D_{2})^{N-1}\) as a (matrix) polynomial in \(W_{2}^{T} W_{2}\) and \(D_{2}\), observing that the highest degree term is \((W_{2}^{T} W_{2})^{N-1}\). Applying the triangle inequality separates this term from the rest of the polynomial. Applying the submultiplicativity of the spectral norm to all the summands and collecting terms (which now consist of commuting scalars, i.e., the spectral norms \(\|W_{1}\|\), \(\|W_{2}^{T} W_{2}\|\) and \(\|D_{2}\|\)) gives the sum in (3.8), where the index \(k=N-1\) is left out as it was already taken care of in the first term in (3.8).

We continue in this way, replacing \((W_{2} W_{2}^{T})^{N-2}\) by \((W_{3}^{T} W_{3} + D_{3})^{N-2}\), and so on. Using also (3.7), we observe that similarly as above, for \(j = 2,\ldots ,N-1\),

$$\begin{aligned} & \bigl\Vert W_{1}^{T} \cdots W_{j}^{T} \bigl(W_{j} W_{j}^{T}\bigr)^{N-j} W_{j} \cdots W_{1} \bigr\Vert \\ &\quad\leq \bigl\Vert W_{1}^{T} \cdots W_{j+1}^{T} \bigl(W_{j+1} W_{j+1}^{T}\bigr)^{N-j-1} W_{j+1} \cdots W_{1} \bigr\Vert \\ &\qquad{}+ \Vert W_{j} \Vert ^{2} \cdots \Vert W_{1} \Vert ^{2} \bigl( \bigl( \Vert W_{j+1} \Vert ^{2} + \delta \bigr)^{N-j} - \Vert W_{j+1} \Vert ^{2(N-j)} \bigr) \\ &\quad\leq \bigl\Vert W_{1}^{T} \cdots W_{j+1}^{T} \bigl(W_{j+1} W_{j+1}^{T}\bigr)^{N-j-1} W_{j+1} \cdots W_{1} \bigr\Vert \\ &\qquad{}+ \Vert W_{1} \Vert ^{2} \bigl( \Vert W_{1} \Vert ^{2}+\delta \bigr)\cdots \bigl( \Vert W_{1} \Vert ^{2} + (j-1)\delta \bigr) \\ &\qquad{}\times\bigl( \bigl( \Vert W_{1} \Vert ^{2} + (j+1 ) \delta \bigr)^{N-j} - \bigl( \Vert W_{1} \Vert ^{2} + j \delta \bigr)^{N-j} \bigr) \\ &\quad\leq \bigl\Vert W_{1}^{T} \cdots W_{j+1}^{T} \bigl(W_{j+1} W_{j+1}^{T}\bigr)^{N-j-1} W_{j+1} \cdots W_{1} \bigr\Vert \\ &\qquad{}+ \bigl( \Vert W_{1} \Vert ^{2}+\delta \bigr) \bigl( \Vert W_{1} \Vert ^{2}+2\delta \bigr)\cdots \bigl( \Vert W_{1} \Vert ^{2} + j\delta \bigr) \\ &\qquad{}\times\bigl( \bigl( \Vert W_{1} \Vert ^{2} + (j+1 )\delta \bigr)^{N-j} - \bigl( \Vert W_{1} \Vert ^{2} + j \delta \bigr)^{N-j} \bigr). \end{aligned}$$

Hereby, we have also used that the function \(x \mapsto (x+\delta )^{N-j} - x^{N-j}\) is monotonically increasing in \(x \geq 0\). With this estimate, we obtain, noting below that the sum in the second line is telescoping, that

$$\begin{aligned} \Vert W_{1} \Vert ^{2N} \leq{}& \bigl\Vert W_{1}^{T} \cdots W_{N}^{T} W_{N} \cdots W_{1} \bigr\Vert \\ & {}+ \sum_{j=1}^{N-1} \bigl( \Vert W_{1} \Vert ^{2}+\delta \bigr) \bigl( \Vert W_{1} \Vert ^{2}+2\delta \bigr)\cdots \bigl( \Vert W_{1} \Vert ^{2} + j\delta \bigr)\\ &{}\times \bigl( \bigl( \Vert W_{1} \Vert ^{2} + (j+1 )\delta \bigr)^{N-j} - \bigl( \Vert W_{1} \Vert ^{2} + j\delta \bigr)^{N-j} \bigr) \\ ={}& \Vert W_{N} \cdots W_{1} \Vert ^{2} + \bigl( \Vert W_{1} \Vert ^{2}+\delta \bigr) \bigl( \Vert W_{1} \Vert ^{2}+2\delta \bigr)\cdots \bigl( \Vert W_{1} \Vert ^{2} + N \delta \bigr) \\ &{}- \bigl( \Vert W_{1} \Vert ^{2}+\delta \bigr)^{N} \\ = {}&\Vert W \Vert ^{2} + Q_{N,\delta}\bigl( \Vert W_{1} \Vert ^{2}+\delta \bigr). \end{aligned}$$

This proves claimed inequality (3.5).

The fact that for all \(z,\alpha \in \mathbb{R}\) it holds \(z(z+\alpha )\leq (z+\frac{\alpha}{2} )^{2}\) implies that

$$\begin{aligned} (x+\delta ) (x+2\delta )\cdots (x+N\delta ) & = \bigl((x+\delta ) (x+N \delta ) \bigr)\cdot (x+2\delta ) \bigl(x+(N-1)\delta \bigr) ) \cdots \\ &\leq \biggl(x + \frac{N+1}{2}\delta \biggr)^{N}. \end{aligned}$$
(3.9)

Setting \(x = \|W_{1}\|^{2}\), \(a= \|W\|^{2}\) and \(b = \frac{N+1}{2}\delta \) and combining inequality (3.5) and the definition of \(Q_{N,\delta}\) with (3.9), leads to \(x^{N} \leq a + (x + b)^{N} - (x+\delta )^{N}\) and hence

$$ x^{N} \leq a + (x + b)^{N} - x^{N}. $$
(3.10)

The mean-value theorem applied to the map \(x \mapsto x^{N}\) gives

$$ (x+b)^{N} = x^{N} + N \xi ^{N-1} b\quad \text{for some } \xi \in [x, x+b]. $$

Hence,

$$ x^{N} \leq a + N\xi ^{N-1}b \leq a + N(x+b)^{N-1} b. $$

We assume now that \(a>0\) and will comment on the case \(a=0\) below. Then, the previous inequality implies

$$\begin{aligned} \frac{x^{N}}{a} \leq & 1+N\frac{ (x+b )^{N-1}b}{a}, \end{aligned}$$

which is equivalent to

$$ \biggl(\frac{x}{a^{\frac{1}{N}}} \biggr)^{N}\leq 1+N \biggl( \frac{x}{a^{\frac{1}{N}}}+\frac{b}{a^{\frac{1}{N}}} \biggr)^{N-1} \frac{b}{a^{\frac{1}{N}}}. $$
(3.11)

Setting \(z=a^{-\frac{1}{N}} x\) and \(c=a^{-\frac{1}{N}} b\), we obtain

$$ z^{N}\leq 1+Nc (z+c )^{N-1}. $$

We claim that \(z\leq 1+2Nc\). Assume on the contrary that \(z> 1+2Nc\). Then, (3.11) gives

$$\begin{aligned} z&\leq \frac{1}{z^{N-1}}+Nc \biggl(1+\frac{c}{z} \biggr)^{N-1} < 1+Nc \biggl(1+\frac{c}{1+2Nc} \biggr)^{N-1} \leq 1+Nc \biggl(1+ \frac{1/2}{N} \biggr)^{N} \\ &\leq 1+Nce^{\frac{1}{2}}. \end{aligned}$$

The last inequality implies \(z\leq 1+2Nc\), which is a contradiction. Thus, we showed the claim that \(z\leq 1+2Nc\), that is, \(x a^{-\frac{1}{N}} \leq 1 + 2N a^{-\frac{1}{N}} b\), which is equivalent to

$$ x \leq a^{\frac{1}{N}}+2Nb. $$
(3.12)

The last inequality also holds in the case \(a=0\), since for \(a=0\) inequality (3.10) remains true if we replace a by any positive number ε and then by our reasoning above \(x\leq \varepsilon ^{\frac{1}{N}}+2Nb\). Since this is true for any \(\varepsilon >0\), it follows that for \(a=0\) we have \(x\leq 2Nb= a^{\frac{1}{N}}+2Nb\), thus (3.12) also holds for \(a=0\).

Using the definitions of \(a,b\) and x, we obtain from (3.12) that

$$ \Vert W_{1} \Vert ^{2}\leq \Vert W \Vert ^{\frac{2}{N}}+N(N+1)\delta. $$

For any \(j = 1,\ldots ,N\), (3.7) implies then that

$$ \Vert W_{j} \Vert ^{2} \leq \Vert W_{1} \Vert ^{2} + (j-1)\delta \leq \Vert W \Vert ^{ \frac{2}{N}} + N(N+1)\delta + (j-1)\delta \leq \Vert W \Vert ^{\frac{2}{N}} + (N+1)^{2} \delta. $$

This completes the proof. □

3.2 Preservation of approximate balancedness

The key ingredient to the proof of Theorem 2.4 is the following proposition. It is a highly nontrivial extension of [11, Lemma 3.1] from \(N=2\) layers to an arbitrary number of layers.

Proposition 3.4

Assume that \(XX^{T}\) has full rank and \(\overrightarrow{W}(0)=(W_{1}(0),\ldots ,W_{N}(0))\) has balancedness constant αδ for some \(\delta > 0\) and \(\alpha \in [0,1)\). Assume that the positive stepsizes \(\eta _{k}\) satisfy (3.14). Then, the gradient descent iterates \(\overrightarrow{W}(k) = (W_{1}(k),\ldots ,W_{N}(k))\) defined by (2.1) satisfy, for all \(k \in \mathbb{N}_{0}\):

  1. (1)

    \(\overrightarrow{W}(k)\) has balancedness constant δ, i.e.,

    $$ \bigl\Vert W^{T}_{j+1}(k)W_{j+1}(k) -W_{j}(k)W_{j}^{T}(k) \bigr\Vert \leq \delta\quad \textit{for all } j=1,\ldots ,N-1; $$
    (3.13)
  2. (2)

    \(L^{N} (\overrightarrow{W}(k) )\leq L^{N} ( \overrightarrow{W}(0) )\);

  3. (3)

    \(\|W_{j}(k)\|^{2}\leq K_{\delta }= M^{\frac{2}{N}}+(N+1)^{2}\delta \) for \(j=1,\ldots ,N\);

  4. (4)

    \(L^{N} (\overrightarrow{W}(k) ) -L^{N} (\overrightarrow{W}(k+1) ) \geq \sigma \eta _{k} \Vert \nabla L^{N} ( \overrightarrow{W}(k) ) \Vert _{F}^{2}\).

Proof

We will show statements (1), (2), and (3) by induction under the condition that

$$ \eta _{k} \leq \min \biggl\{ \frac{2(1-\sigma )}{B_{\delta}}, \frac{\sigma (1-\alpha ) \delta}{2 L^{N}(\overrightarrow{W}(0))} \biggr\} \quad\text{for all } k \in \mathbb{N}, $$
(3.14)

hold for some \(\sigma \in (0,1)\). The choice

$$ \sigma = \frac{4 L^{N}(\overrightarrow{W}(0))}{4 L^{N}(\overrightarrow{W}(0)) + (1-\alpha )\delta B_{\delta}} $$

reduces (3.14) to (2.8). In the induction step for (2), we will show that if (3) holds for k, then (4) also holds for k. Below, we will always denote \(W(k) = W_{N}(k) \cdots W_{1}(k)\).

Since \(\overrightarrow{W}(0)\) has balancedness constant \(\alpha \delta <\delta \) by assumption, (3.13) is clearly satisfied for \(k=0\). Statement (2) is trivial for \(k=0\). The bound in (3) follows from a direct combination of Proposition 3.2 with Lemma 3.1, i.e., for \(j=1,\ldots ,N\),

$$\begin{aligned} \bigl\Vert W_{j}(0) \bigr\Vert ^{2} &\leq \bigl\Vert W(0) \bigr\Vert ^{\frac{2}{N}} + (N+1)^{2}\delta \leq \biggl( \frac{\sqrt{2 L^{N}(\overrightarrow{W}(0))} + \Vert Y \Vert }{\sigma _{\min}(X)} \biggr)^{\frac{2}{N}} + (N+1)^{2} \delta\\ & = M^{\frac{2}{N}} + (N+1)^{2} \delta, \end{aligned}$$

using also the definition of M in (2.11).

For the induction step, we assume that \((1)\), \((2)\), and \((3)\) hold for \(0,1,\dots, k\) and prove that these three properties also hold for \(k+1\).

Step 1: We first prove statement (2) for \(k+1\). To do so, we will show that if statement (3) holds for k, then statement (4) also holds for k. This also proves (4) once the induction for (1), (2), and (3) is completed.

We consider the Taylor expansion

$$\begin{aligned} L^{N} \bigl(\overrightarrow{W}(k+1) \bigr) ={} &L^{N} \bigl( \overrightarrow{W}(k) \bigr)+ \bigl\langle \nabla L^{N} \bigl( \overrightarrow{W}(k) \bigr),\overrightarrow{W}(k+1)-\overrightarrow{W}(k) \bigr\rangle \\ &{} +\frac{1}{2} \bigl\langle \bigl( \overrightarrow{W}(k+1)- \overrightarrow{W}(k) \bigr)^{T}\nabla ^{2} L^{N} ( \overrightarrow{A}_{\xi} ), \overrightarrow{W}(k+1)- \overrightarrow{W}(k) \bigr\rangle , \end{aligned}$$

where

$$\nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr)= \begin{pmatrix} \nabla _{W_{1}}L^{N} (\overrightarrow{W}(k) ) \\ \vdots \\ \nabla _{W_{N}}L^{N} (\overrightarrow{W}(k) ) \end{pmatrix} $$

and \(\overrightarrow{A}_{\xi}= (A^{1}_{\xi},\dots,A^{N}_{\xi} )\) with

$$ A^{i}_{\xi}=W_{i}(k)+\xi \bigl(W_{i}(k+1)-W_{i}(k) \bigr) \quad\text{for some } \xi \in [0,1],~i=1,\dots,N. $$

Since by definition \(W_{j}(k+1)=W_{j}(k)-\eta _{k} \nabla _{W_{j}}L^{N}( \overrightarrow{W}(k))\), this Taylor expansion can be written as

$$\begin{aligned} L^{N} \bigl(\overrightarrow{W}(k+1) \bigr)={}&L^{N} \bigl( \overrightarrow{W}(k) \bigr)-\eta _{k} \bigl\langle \nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr),\nabla L^{N} \bigl( \overrightarrow{W}(k) \bigr) \bigr\rangle _{F} \\ &{}+\frac{1}{2}\eta ^{2}_{k} \bigl\langle \nabla L^{N} \bigl( \overrightarrow{W}(k) \bigr), \nabla ^{2} L^{N} ( \overrightarrow{A}_{\xi} )\nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr\rangle _{F}. \end{aligned}$$

By the Cauchy–Schwarz inequality, we obtain

$$\begin{aligned} & L^{N} \bigl(\overrightarrow{W}(k) \bigr)-L^{N} \bigl( \overrightarrow{W}(k+1) \bigr) \\ &\quad\geq \eta _{k} \bigl\Vert \nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr\Vert _{F}^{2} -\frac{1}{2} \eta ^{2}_{k} \bigl\Vert \nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr\Vert _{F}^{2} \bigl\Vert \nabla ^{2} L^{N} (\overrightarrow{A}_{ \xi} ) \bigr\Vert _{F\to F} \\ &\quad\geq \biggl(1-\frac{1}{2}\eta _{k} \bigl\Vert \nabla ^{2} L^{N} ( \overrightarrow{A}_{\xi} ) \bigr\Vert _{F\to F} \biggr)\eta _{k} \bigl\Vert \nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr\Vert _{F}^{2}. \end{aligned}$$
(3.15)

The crucial point now is to show that \(\Vert \nabla ^{2} L^{N} (\overrightarrow{A}_{\xi} ) \Vert _{F\to F}\) is bounded by the constant \(B_{\delta}\) defined in (2.9). By setting \(\overrightarrow{\Delta}= (\Delta _{1},\dots, \Delta _{N} )\) with \(\Delta _{j}\in \mathbb{R}^{d_{j}\times d_{j-1}}\), \(j=1,\ldots ,N\), and writing \(\nabla ^{2} L^{N} (\overrightarrow{W} ) ( \overrightarrow{\Delta},\overrightarrow{\Delta} )\) for \(\langle \overrightarrow{\Delta}, \nabla ^{2} L^{N} ( \overrightarrow{W} )\overrightarrow{\Delta} \rangle \), the quadratic form \(\nabla ^{2} L^{N} (\overrightarrow{W} ) ( \overrightarrow{\Delta},\overrightarrow{\Delta} )\) defined by the Hessian can be written as

$$\begin{aligned} \nabla ^{2} L^{N} (\overrightarrow{W} ) ( \overrightarrow{ \Delta},\overrightarrow{\Delta} )&=\sum_{j=1}^{N} \sum_{i=1}^{N} \biggl\langle \Delta _{j}, \frac{\partial ^{2} L^{N}(\overrightarrow{W})}{\partial W_{i}\partial W_{j}} \Delta _{i} \biggr\rangle \\ &=\sum_{i=1}^{N} \biggl\langle \Delta _{i}, \frac{\partial ^{2} L^{N}(\overrightarrow{W})}{\partial W_{i}^{2}} \Delta _{i} \biggr\rangle +\sum _{j=1}^{N}\sum_{ \substack{i=1 \\ i\neq j}}^{N} \biggl\langle \Delta _{j}, \frac{\partial ^{2} L^{N}(\overrightarrow{W})}{\partial W_{i}\partial W_{j}} \Delta _{i} \biggr\rangle . \end{aligned}$$

In order to compute mixed second derivatives, we introduce the notation

$$\begin{aligned} &Q_{i}(\overrightarrow{W},\Delta _{i}) = W_{N} \cdots W_{i+1} \Delta _{i} W_{i-1} \cdots W_{1} X, \\ &P_{i,j}(\overrightarrow{W}, \Delta _{i}, \Delta _{j}) = \textstyle\begin{cases} W_{N} \cdots W_{j+1} \Delta _{j} W_{j-1} \cdots W_{i+1} \Delta _{i} W_{i-1} \cdots W_{1} & \text{if } j > i, \\ W_{N} \cdots W_{i+1} \Delta _{i} W_{i-1} \cdots W_{j+1} \Delta _{j} W_{j-1} \cdots W_{1} & \text{if } j < i, \end{cases}\displaystyle \end{aligned}$$

with the understanding that \(W_{i-1} \cdot W_{1} = \operatorname{Id}\) for \(i = 1\) and \(W_{N}\cdots W_{i+1} = \operatorname{Id}\) for \(i=N\). Using the first partial derivatives of \(L^{N}\), cf. (3.2), we obtain, for \(i=1,\ldots , N\),

$$\begin{aligned} \biggl\langle \Delta _{i}, \frac{\partial ^{2} L^{N}(\overrightarrow{W})}{\partial W_{i}^{2}} \Delta _{i} \biggr\rangle = \bigl\langle Q_{i}(\overrightarrow{W}, \Delta _{i}), Q_{i}(\overrightarrow{W}, \Delta _{i}) \bigr\rangle = \bigl\Vert Q_{i}( \overrightarrow{W}, \Delta _{i}) \bigr\Vert _{F}^{2}. \end{aligned}$$

The mixed second order derivatives are given, for \(i \neq j\), by

$$\begin{aligned} \biggl\langle \Delta _{i}, \frac{\partial ^{2} L^{N}(\overrightarrow{W})}{\partial W_{i}\partial W_{j}} \Delta _{j} \biggr\rangle & = \bigl\langle Q_{i}(\overrightarrow{W}, \Delta _{i}), Q_{j}(\overrightarrow{W},\Delta _{j}) \bigr\rangle + \bigl\langle L^{N}(\overrightarrow{W}), P_{i,j}(\overrightarrow{W}, \Delta _{i},\Delta _{j}) \bigr\rangle . \end{aligned}$$

This implies that

$$\begin{aligned} \nabla ^{2} L^{N} (\overrightarrow{A_{\xi}} ) ( \overrightarrow{\Delta},\overrightarrow{\Delta} ) ={}& \sum _{i=1}^{N} \bigl\Vert Q_{i}( \overrightarrow{A}_{\xi},\Delta _{i}) \bigr\Vert _{F}^{2} + \sum_{ \substack{i,j = 1\\i \neq j}}^{N} \bigl\langle Q_{i}( \overrightarrow{A}_{\xi},\Delta _{i}), Q_{j}(\overrightarrow{A}_{\xi}, \Delta _{j}) \bigr\rangle \\ & {}+ \sum_{\substack{i,j = 1\\i \neq j}}^{N} \bigl\langle A_{\xi }XX^{T} - YX^{T}, P_{i,j}( \overrightarrow{A}_{\xi}, \Delta _{i},\Delta _{j}) \bigr\rangle , \end{aligned}$$

where \(A_{\xi }= A_{\xi}^{N} \cdots A_{\xi}^{1}\). The Cauchy–Schwarz inequality for the trace inner product together with \(\|A B \|_{F} \leq \|A\| \|B\|_{F}\) for any matrices \(A,B\) of matching dimensions gives, for \(i > j\),

$$\begin{aligned} & \bigl\vert \bigl\langle A_{\xi }XX^{T} - YX^{T}, P_{i,j}( \overrightarrow{A}_{\xi}, \Delta _{i},\Delta _{j}) \bigr\rangle \bigr\vert \\ &\quad = \bigl\vert \operatorname{tr}\bigl(\bigl(A_{\xi }XX^{T} - YX^{T}\bigr)^{T} A_{\xi}^{N} \cdots A_{ \xi}^{i+1} \Delta _{i} A_{\xi}^{i-1} \cdots A_{\xi}^{j+1} \Delta _{j} A_{ \xi}^{j-1} \cdots A_{\xi}^{1}\bigr) \bigr\vert \\ &\quad \leq \bigl\Vert \bigl(A_{\xi }XX^{T} - YX^{T} \bigr)^{T} A_{\xi}^{N} \cdots A_{\xi}^{i+1} \Delta _{i} \bigr\Vert _{F} \bigl\Vert A_{\xi}^{i-1} \cdots A_{\xi}^{j+1} \Delta _{j} A_{ \xi}^{j-1} \cdots A_{\xi}^{1} \bigr\Vert _{F} \\ &\quad \leq \bigl\Vert A_{\xi }XX^{T} - YX^{T} \bigr\Vert \bigl\Vert A_{\xi}^{N} \bigr\Vert \cdots \bigl\Vert A_{ \xi}^{i+1} \bigr\Vert \Vert \Delta _{i} \Vert _{F} \bigl\Vert A_{\xi}^{i-1} \bigr\Vert \cdots \bigl\Vert A_{ \xi}^{j+1} \bigr\Vert \Vert \Delta _{j} \Vert _{F} \bigl\Vert A_{\xi}^{j-1} \bigr\Vert \cdots \bigl\Vert A_{\xi}^{1} \bigr\Vert , \end{aligned}$$

and similarly, for \(i< j\). Another application of the Cauchy–Schwarz inequality gives

$$ \bigl\vert \bigl\langle Q_{i}(\overrightarrow{A}_{\xi}, \Delta _{i}), Q_{j}( \overrightarrow{A}_{\xi}, \Delta _{j})\bigr\rangle \bigr\vert \leq \bigl\Vert Q_{i}( \overrightarrow{A}_{\xi},\Delta _{i}) \bigr\Vert _{F} \bigl\Vert Q_{j}( \overrightarrow{A}_{\xi}, \Delta _{j}) \bigr\Vert _{F}. $$

Consequently,

$$\begin{aligned} \bigl\vert \nabla ^{2} L^{N} (\overrightarrow{A_{\xi}} ) ( \overrightarrow{\Delta},\overrightarrow{\Delta} ) \bigr\vert \leq{}& \sum_{i,j = 1}^{N} \bigl\Vert Q_{i}(\overrightarrow{A}_{\xi},\Delta _{i}) \bigr\Vert _{F} \bigl\Vert Q_{j}(\overrightarrow{A}_{\xi}, \Delta _{j}) \bigr\Vert _{F} \\ &{} + \sum_{\substack{i,j = 1\\i \neq j}}^{N} \bigl\Vert A_{\xi }XX^{T} - Y X^{T} \bigr\Vert \Vert \Delta _{i} \Vert _{F} \Vert \Delta _{j} \Vert _{F} \prod_{ \substack{k=1 \\ k \neq i,j}}^{N} \bigl\Vert A_{\xi}^{k} \bigr\Vert \\ \leq{}& \Vert X \Vert ^{2} \sum_{i,j=1}^{N} \Vert \Delta _{i} \Vert _{F} \Vert \Delta _{j} \Vert _{F} \Biggl(\prod _{\underset{k\neq i}{k=1}}^{N} \bigl\Vert A^{k}_{\xi} \bigr\Vert \Biggr) \Biggl(\prod_{\underset{k\neq j}{k=1}}^{N} \bigl\Vert A^{k}_{\xi} \bigr\Vert \Biggr) \\ &{} + \sum_{\substack{i,j = 1\\i \neq j}}^{N} \Vert \Delta _{i} \Vert _{F} \Vert \Delta _{j} \Vert _{F} \Biggl( \Vert X \Vert ^{2} \prod _{k=1}^{N} \bigl\Vert A_{\xi}^{k} \bigr\Vert + \bigl\Vert Y X^{T} \bigr\Vert \Biggr) \prod _{\substack{k=1 \\ k \neq i,j}}^{N} \bigl\Vert A_{ \xi}^{k} \bigr\Vert . \end{aligned}$$
(3.16)

Using the recursive definition of \(W_{i}(k+1)\) and that \(\xi \in [0,1]\), we further obtain, for \(i=1,\dots,N\),

$$\begin{aligned} \bigl\Vert A^{i}_{\xi} \bigr\Vert &= \bigl\Vert W_{i}(k)+\xi \bigl(W_{i}(k+1)-W_{i}(k) \bigr) \bigr\Vert \leq \bigl\Vert W_{i}(k) \bigr\Vert + \bigl\Vert W_{i}(k+1)-W_{i}(k) \bigr\Vert \\ & = \bigl\Vert W_{i}(k) \bigr\Vert + \bigl\Vert \eta _{k}\nabla _{W_{i}}L^{N} \bigl( \overrightarrow{W}(k) \bigr) \bigr\Vert \\ & = \bigl\Vert W_{i}(k) \bigr\Vert + \eta _{k} \bigl\Vert W_{i+1}^{T}(k)\cdots W_{N}^{T}(k) \nabla L^{1}\bigl(W(k)\bigr)W_{1}(k)^{T}\cdots W_{i-1}(k) \bigr\Vert . \end{aligned}$$

It follows from (3.4) and induction hypothesis (2) for k that

$$ \bigl\Vert \nabla L^{1}\bigl(W(k)\bigr) \bigr\Vert \leq \sqrt{2 L^{N}\bigl(\overrightarrow{W}(k)\bigr)} \Vert X \Vert \leq \sqrt{2 L^{N}\bigl(\overrightarrow{W}(0)\bigr)} \Vert X \Vert . $$
(3.17)

Using induction hypothesis (3) for k this gives

$$\begin{aligned} \bigl\Vert A^{i}_{\xi} \bigr\Vert &\leq \bigl\Vert W_{i}(k) \bigr\Vert +\eta _{k} \sqrt{2 L^{N} \bigl( \overrightarrow{W}(0)\bigr)} \Vert X \Vert \Biggl(\prod _{\underset{j\neq i}{j=1}}^{N} \bigl\Vert W_{j}(k) \bigr\Vert \Biggr) \\ & \leq K_{\delta}^{1/2} + \eta _{k} \sqrt{2 L^{N}\bigl(\overrightarrow{W}(0)\bigr)} \Vert X \Vert K_{\delta}^{\frac{N-1}{2}}. \end{aligned}$$

By assumption (3.14) on the stepsize \(\eta _{k}\) and the definitions of \(K_{\delta}\) and \(B_{\delta}\), we have

$$\begin{aligned} &\eta _{k} \sqrt{2L^{N}\bigl(\overrightarrow{W}(0)\bigr)} \Vert X \Vert K_{\delta}^{ \frac{N-1}{2}}\\ & \quad\leq \frac{2(1-\sigma )}{B_{\delta}} \sqrt{2 L^{N}\bigl( \overrightarrow{W}(0)\bigr)} \Vert X \Vert K_{\delta}^{\frac{N-1}{2}} \leq \frac{2 \sqrt{2 L^{N}(\overrightarrow{W}(0))} \Vert X \Vert K_{\delta}^{\frac{N-1}{2}}}{2eN K_{\delta}^{N-1} \Vert X \Vert ^{2}} \\ & \quad\leq \frac{M \sigma _{\min}(X) \Vert X \Vert K_{\delta}^{\frac{N-1}{2}}}{eN K_{\delta}^{N-1} \Vert X \Vert ^{2}} \leq \frac{(M^{\frac{2}{N}} + N^{2} \delta )^{\frac{N}{2}} K_{\delta}^{\frac{N-1}{2}}}{eN K_{\delta}^{N-1}} \leq \frac{1}{2N} K_{\delta}^{\frac{1}{2}}. \end{aligned}$$

In the first inequality of the last line, we used the fact that by definition of M

$$ \sqrt{2 L^{N}\bigl(\overrightarrow{W}(0)\bigr)}=M \sigma _{\min}(X)- \Vert Y \Vert \leq M \sigma _{\min}(X), $$

and in the last inequality of the last line, we used \(M^{\frac {2}{N}}+N^{2}\delta \leq M^{\frac {2}{N}}+(N+1)^{2}\delta =K_{ \delta}\).

It follows that

$$ \bigl\Vert A^{i}_{\xi} \bigr\Vert \leq \biggl(1 + \frac{1}{2N} \biggr) K_{\delta}^{1/2}. $$

Substituting this bound into (3.16), we obtain

$$\begin{aligned} & \bigl\vert \nabla ^{2} L^{N} (\overrightarrow{A_{\xi}} ) (\overrightarrow{\Delta},\overrightarrow{\Delta} ) \bigr\vert \\ &\quad\leq \biggl(1+\frac{1}{2N} \biggr)^{2N-2} K_{\delta}^{N-1} \Vert X \Vert ^{2}\sum _{i,j=1}^{N} \Vert \Delta _{j} \Vert _{F} \Vert \Delta _{i} \Vert _{F} \\ &\qquad{}+ \biggl( \biggl(1+\frac{1}{2N} \biggr)^{2N-2} K_{\delta}^{N-1} \Vert X \Vert ^{2}\\ &\qquad{}+ \biggl(1+ \frac{1}{2N} \biggr)^{N-2} K_{\delta}^{N/2-1} \bigl\Vert XY^{T} \bigr\Vert \biggr) \sum_{j=1}^{N} \sum_{\underset{i\neq j}{i=1}}^{N} \Vert \Delta _{j} \Vert _{F} \Vert \Delta _{i} \Vert _{F} \\ &\quad\leq e K_{\delta}^{N-1} \Vert X \Vert ^{2} \Biggl( \sum_{j=1}^{N} \Vert \Delta _{j} \Vert \Biggr)^{2}+ \bigl(e K_{\delta}^{N-1} \Vert X \Vert ^{2}+e^{1/2} K_{\delta}^{N/2-1} \bigl\Vert XY^{T} \bigr\Vert \bigr) \Biggl( \sum _{j=1}^{N} \Vert \Delta _{j} \Vert \Biggr)^{2} \\ &\quad\leq \bigl[2e N K_{\delta}^{N-1} \Vert X \Vert ^{2} + \sqrt{e} N K_{\delta}^{N/2-1} \bigl\Vert XY^{T} \bigr\Vert \bigr] \Vert \overrightarrow{\Delta} \Vert _{F}^{2}, \end{aligned}$$

where we have used the fact that \((1+1/(2N))^{2N} \leq e\) and that \(\sum_{j=1}^{N}\|\Delta _{W_{j}}\|_{F} \leq \sqrt{N} \| \overrightarrow{\Delta}\|_{F}\). Hence, we derived that

$$ \bigl\Vert \nabla ^{2}L^{N} (\overrightarrow{A_{\xi}} ) \bigr\Vert _{F\to F} \leq 2e N K_{\delta}^{N-1} \Vert X \Vert ^{2} + \sqrt{e} N K_{ \delta}^{\frac{N}{2}-1} \bigl\Vert XY^{T} \bigr\Vert = B_{\delta}. $$

Substituting this estimate into (3.15) and using that the stepsizes satisfy (3.14) gives

$$\begin{aligned} L^{N} \bigl(\overrightarrow{W}(k) \bigr) -L^{N} \bigl( \overrightarrow{W}(k+1) \bigr)&\geq \biggl(1-\frac{1}{2}\eta _{k} B_{ \delta} \biggr)\eta _{k} \bigl\Vert \nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr\Vert _{F}^{2} \\ & \geq \sigma \eta _{k} \bigl\Vert \nabla L^{N} \bigl( \overrightarrow{W}(k) \bigr) \bigr\Vert _{F}^{2} \geq 0. \end{aligned}$$
(3.18)

This shows statement (4) for k. It follows by induction hypothesis (2) for k that

$$ L^{N} \bigl(\overrightarrow{W}(0) \bigr) \geq L^{N} \bigl( \overrightarrow{W}(k) \bigr) \geq L^{N} \bigl(\overrightarrow{W}(k+1) \bigr). $$

This shows statement (2) for \(k+1\).

Step 2: Let us now show that statement (1) holds at iteration \(k+1\). For \(j=1,\ldots ,N-1\), we obtain

$$\begin{aligned} & \bigl\Vert W^{T}_{j+1}(k+1)W_{j+1}(k+1) -W_{j}(k+1)W_{j}^{T}(k+1) \bigr\Vert \\ &\quad= \bigl\Vert \bigl(W_{j+1}(k)-\eta _{k}\nabla _{W_{j+1}}L^{N} \bigl( \overrightarrow{W}(k) \bigr) \bigr)^{T} \bigl(W_{j+1}(k)-\eta _{k}\nabla _{W_{j+1}}L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr) \\ &\qquad{}- \bigl(W_{j}(k)-\eta _{k} \nabla _{W_{j}}L^{N} \bigl( \overrightarrow{W}(k) \bigr) \bigr) \bigl(W_{j}(k)-\eta _{k}\nabla _{W_{j}}L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr)^{T} \bigr\Vert \\ &\quad=\|W^{T}_{j+1}(k)W_{j+1}(k)-W_{j}(k)W_{j}^{T}(k) \\ &\qquad{}+\eta _{k} \bigl(- W_{j+1}^{T}(k)W_{j+2}^{T}(k) \dots W_{N}^{T}(k) \nabla L^{1}\bigl(W(k) \bigr)W_{1}^{T}(k)\cdots W_{j}^{T}(k) \\ &\qquad{}- W_{j}(k)\cdots W_{1}(k)\nabla ^{T} L^{1}\bigl(W(k)\bigr)W_{N}(k) \cdots W_{j+2}(k)W_{j+1}(k) \bigr) \\ &\qquad{}+ W_{j}(k)W_{j-1}(k)\cdots W_{1}(k)\nabla ^{T} L^{1}\bigl(W(k)\bigr)W_{N}(k) \cdots W_{j+2}(k)W_{j+1}(k) \\ &\qquad{}+ W_{j+1}^{T}(k)W_{j+2}^{T}(k)\cdots W_{N}^{T}(k) \nabla L^{1}\bigl(W(k) \bigr)W_{1}^{T}(k)\cdots W_{j-1}^{T}(k)W_{j}^{T}(k) ) \\ &\qquad{}+ \eta _{k}^{2} \bigl(\nabla ^{T}_{W_{j+1}}L^{N} \bigl(\overrightarrow{W}(k) \bigr)\nabla _{W_{j+1}}L^{N} \bigl( \overrightarrow{W}(k) \bigr)- \nabla _{W_{j}}L^{N} \bigl( \overrightarrow{W}(k) \bigr)\nabla ^{T}_{W_{j}}L^{N} \bigl( \overrightarrow{W}(k) \bigr) \bigr)\| \\ &\quad\leq \bigl\Vert W^{T}_{j+1}(k)W_{j+1}(k)-W_{j}(k)W_{j}^{T}(k) \bigr\Vert + \eta _{k}^{2} \bigl( \bigl\Vert \nabla _{W_{j+1}} L^{N}\bigl(\overrightarrow{W}(k)\bigr) \bigr\Vert ^{2} + \bigl\Vert \nabla _{W_{j}} L^{N}\bigl( \overrightarrow{W}(k)\bigr) \bigr\Vert ^{2} \bigr). \end{aligned}$$

Applying this inequality repeatedly, we obtain

$$\begin{aligned} & \bigl\Vert W^{T}_{j+1}(k+1)W_{j+1}(k+1) -W_{j}(k+1)W_{j}^{T}(k+1) \bigr\Vert \\ & \quad\leq \bigl\Vert W^{T}_{j+1}(0)W_{j+1}(0) -W_{j}(0)W_{j}^{T}(0) \bigr\Vert \\ &\qquad{}+ \sum _{ \ell =0}^{k} \eta _{\ell}^{2} \bigl( \bigl\Vert \nabla _{W_{j+1}} L^{N}\bigl( \overrightarrow{W}(\ell )\bigr) \bigr\Vert ^{2} + \bigl\Vert \nabla _{W_{j}} L^{N}\bigl( \overrightarrow{W}(\ell )\bigr) \bigr\Vert ^{2} \bigr) \\ & \quad\leq \alpha \delta + 2 \Bigl(\max_{\ell =0,\ldots ,k} \eta _{\ell} \Bigr) \sum_{\ell =0}^{k} \eta _{\ell } \bigl\Vert \nabla L^{N}\bigl( \overrightarrow{W}(\ell )\bigr) \bigr\Vert _{F}^{2}, \end{aligned}$$
(3.19)

where we have used the fact that \(\overrightarrow{W}(0)\) has balancedness constant αδ by assumption and that

$$\begin{aligned} \bigl\Vert \nabla L^{N}\bigl(\overrightarrow{W}(k)\bigr) \bigr\Vert _{F}^{2} & \geq \max_{ \ell = 1,\ldots ,N} \bigl\Vert \nabla _{W_{\ell}} L^{N}\bigl(\overrightarrow{W}(k)\bigr) \bigr\Vert ^{2} \\ & \geq \frac{1}{2} \bigl( \bigl\Vert \nabla _{W_{j}} L^{N}\bigl(\overrightarrow{W}(k)\bigr) \bigr\Vert ^{2} + \bigl\Vert \nabla _{W_{j}+1} L^{N}\bigl(\overrightarrow{W}(k) \bigr) \bigr\Vert ^{2} \bigr). \end{aligned}$$

Inequality (3.18) from the previous step gives

$$\begin{aligned} L^{N}\bigl(\overrightarrow{W}(0)\bigr) - L^{N}\bigl(\overrightarrow{W}(k+1)\bigr)& = \sum _{j=0}^{k} \bigl(L^{N}\bigl( \overrightarrow{W}(j)\bigr) - L^{N}\bigl(\overrightarrow{W}(j+1)\bigr) \bigr) \\ & \geq \sigma \sum_{j=0}^{k} \eta _{k} \bigl\Vert \nabla L^{N}\bigl( \overrightarrow{W}(k) \bigr) \bigr\Vert _{F}^{2}. \end{aligned}$$
(3.20)

Combining inequalities (3.19) and (3.20) yields

$$\begin{aligned} & \bigl\Vert W^{T}_{j+1}(k+1)W_{j+1}(k+1) -W_{j}(k+1)W_{j}^{T}(k+1) \bigr\Vert \\ &\quad \leq \alpha \delta + \frac{2}{\sigma} \Bigl(\max_{\ell =0,\ldots ,k} \eta _{\ell} \Bigr) \bigl(L^{N}\bigl(\overrightarrow{W}(0) \bigr) - L^{N}\bigl( \overrightarrow{W}(k+1)\bigr) \bigr) \\ &\quad \leq \alpha \delta + \frac{2}{\sigma} \Bigl(\max_{\ell =0,\ldots ,k} \eta _{\ell} \Bigr) L^{N}\bigl(\overrightarrow{W}(0)\bigr) \leq \alpha \delta + (1-\alpha ) \delta = \delta, \end{aligned}$$

where we have used Condition (3.14) on the stepsizes. This proves statement (1) for \(k+1\).

Step 3: For the proof of statement (3) for \(k+1\), we use the fact that we have already shown that (1) and (2) hold for \(k+1\). It follows from Proposition 3.2 and Lemma 3.1 that

$$ \bigl\Vert W_{j}(k+1) \bigr\Vert ^{2} \leq \bigl\Vert W(k+1) \bigr\Vert ^{\frac{2}{N}} + (N+1)^{2} \delta \leq \biggl( \frac{\sqrt{2L^{N}(\overrightarrow{W}(0))} + \Vert Y \Vert }{\sigma _{\min}(X)} \biggr)^{\frac{2}{N}} + (N+1)^{2} \delta = K_{\delta}. $$

This shows (3) for \(k+1\) and completes the proof of the proposition. □

3.3 Convergence of gradient descent to a critical point

We will use a result from [1] to prove Theorem 2.4, which is based on the following definition.

Definition 3.5

(Strong descent conditions [1])

We say that a sequence \(x_{k}\in \mathbb{R}^{n}\) satisfies the strong descent conditions (for a differentiable function \(f:\mathbb{R}^{n}\to \mathbb{R}\)) if

$$\begin{aligned} & f(x_{k})-f(x_{k+1}) \geq \sigma \bigl\Vert \nabla f(x_{k}) \bigr\Vert \Vert x_{k+1}-x_{k} \Vert \end{aligned}$$
(3.21)
$$\begin{aligned} &\quad\text{and } f(x_{k+1})=f(x_{k})\quad \implies\quad x_{k+1}=x_{k} \end{aligned}$$
(3.22)

hold for some \(\sigma >0\) and for all k larger than some K.

The next theorem is essentially an extension of the Lojasiewicz theorem to discrete variants of gradient flows.

Theorem 3.6

[1, Theorem 3.2] Let \(f:\mathbb{R}^{n}\to \mathbb{R}\) be an analytic cost function. Let the sequence \(\{x_{k}\}_{k=1,2,\dots}\) satisfy the strong descent conditions (Definition 3.5). Then, either \(\lim_{k\to \infty}\|x_{k}\|=+\infty \), or there exists a single point \(x^{*}\in \mathbb{R}\) such that

$$ \lim_{k\to \infty}x_{k}=x^{*} $$

Now, we are ready to prove Theorem 2.4.

Proof

By point (4) of Proposition 3.4 and since \(\overrightarrow{W}(k+1) - \overrightarrow{W}(k)=\eta _{k}\nabla L^{N} (\overrightarrow{W}(k) )\) for all \(k\in \mathbb{N}_{0}\), we have

$$ L^{N} \bigl(\overrightarrow{W}(k) \bigr) -L^{N} \bigl( \overrightarrow{W}(k+1) \bigr)\geq \sigma \bigl\Vert \nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr\Vert _{F} \bigl\Vert \overrightarrow{W}(k+1) - \overrightarrow{W}(k) \bigr\Vert _{F}, $$
(3.23)

which means that the first part (3.21) of the strong descent condition holds. This implies then that also the second part (3.22) of the strong descent condition holds, since if \(L^{N} (\overrightarrow{W}(k+1) ) = L^{N} ( \overrightarrow{W}(k) )\), it follows that

$$ \bigl\Vert \nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr\Vert _{F} \bigl\Vert \overrightarrow{W}(k+1) - \overrightarrow{W}(k) \bigr\Vert _{F}=0, $$

hence \(\overrightarrow{W}(k+1) = \overrightarrow{W}(k)\) or \(\nabla L^{N} (\overrightarrow{W}(k) )=0\), but the latter again implies \(\overrightarrow{W}(k+1) = \overrightarrow{W}(k)\). Thus, indeed \(\overrightarrow{W}(k+1) = \overrightarrow{W}(k)\) if \(L^{N} (\overrightarrow{W}(k+1) ) = L^{N} ( \overrightarrow{W}(k) )\).

Since by Proposition 3.4, the sequence \((\overrightarrow{W}(k))_{k\in \mathbb{N}_{0}}\) is bounded and \(L^{N}\) is analytic, it follows from Theorem 3.6 that there exists \(\overrightarrow{W}^{*}\) such that

$$ \lim_{k\to \infty}\overrightarrow{W}(k)=\overrightarrow{W}^{*}. $$

It remains to show that \(\overrightarrow{W}^{*}\) is a critical point of \(L^{N}\). Since \(\nabla L^{N} (\overrightarrow{W} )\) is continuous in \(\overrightarrow{W}\), it follows that \(\nabla L^{N} (\overrightarrow{W}^{*} )=\lim_{k\to \infty} \nabla L^{N} (\overrightarrow{W}(k) )\) and that

$$ \bigl\Vert \nabla L^{N} \bigl(\overrightarrow{W}^{*} \bigr) \bigr\Vert _{F}=\lim_{k \to \infty} \bigl\Vert \nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr\Vert _{F}=:c. $$

In order to show that \(\overrightarrow{W}^{*}\) is a critical point, it suffices to show that \(c=0\). A repeated application of point (4) of Proposition 3.4 gives

$$ L^{N} \bigl(\overrightarrow{W}(0) \bigr) -L^{N} \bigl( \overrightarrow{W}(k+1) \bigr)\geq \sigma \sum_{j=0}^{k} \eta _{j} \bigl\Vert \nabla L^{N} \bigl( \overrightarrow{W}(j) \bigr) \bigr\Vert _{F}^{2}\quad \text{for any } k \in \mathbb{N}, $$

hence, taking the limit,

$$ L^{N} \bigl(\overrightarrow{W}(0) \bigr)\geq \sigma \sum _{k=0}^{\infty} \eta _{k} \bigl\Vert \nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr\Vert _{F}^{2}. $$

Assume now that \(c\neq 0\). Then, \(c>0\), and there exists \(k_{0}\in \mathbb{N}\) such that

$$ \bigl\Vert \nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr\Vert _{F}\geq \frac{c}{2}\quad \forall k\geq k_{0}. $$

But then

$$ L^{N} \bigl(\overrightarrow{W}(0) \bigr)\geq \sigma \sum _{k=k_{0}}^{ \infty}\eta _{k} \bigl\Vert \nabla L^{N} \bigl(\overrightarrow{W}(k) \bigr) \bigr\Vert _{F}^{2} \geq \frac{c^{2}}{4}\sigma \sum _{k=k_{0}}^{\infty}\eta _{k}, $$

which by \(\sigma > 0\) contradicts our assumption that \(\sum_{k=0}^{\infty}\eta _{k}=\infty \). Thus, indeed \(c=0\), and \(\overrightarrow{W}^{*}\) is a critical point of \(L^{N}\). □

4 Convergence to a global minimum for almost all initializations

Let us now transfer [5, Theorem 6.12] to our situation of the gradient descent method by showing Theorem 2.6. Our proof is based on the following abstract theorem, which basically states that gradient descent schemes avoid strict saddle points for almost all initializations. The case of constant stepsizes (condition (1)) was shown in [20, Proposition 1], while the one for stepsizes converging to zero was proven in [23, Theorem 5.1]. We call a critical point \(z^{*}\) of a twice continuously differentiable function f a strict saddle point if the Hessian \(\nabla ^{2} f (z^{*})\) has at least one negative eigenvalue. Intuitively, this means that there is a direction (indicated by the eigenvector corresponding to a negative eigenvalue) in which the function decreases like a square function. Such decay is fast enough in order to direct almost all trajectories away from the saddle point (and towards such directions of decrease). This intuition is made rigorous in the following theorem.

Theorem 4.1

Let \(f: \mathbb{R}^{p} \to \mathbb{R}\) be a twice continuously differentiable function and consider the gradient descent scheme

$$ z(k+1) = z(k) - \eta _{k} \nabla f\bigl(z(k)\bigr), $$

where \((\eta _{k})\) satisfies one of the following conditions.

  1. (1)

    The sequence \((\eta _{k})\) is constant, i.e., \(\eta _{k} = \eta \) for some \(\eta > 0\) for all \(k \in \mathbb{N}\).

  2. (2)

    It holds

    $$ \eta _{k} \geq C \frac{1}{k} \quad\textit{for some } C > 0 \quad\textit{and}\quad \lim_{k \to \infty} \eta _{k} = 0. $$

Then, the set of initializations \(z(0) \in \mathbb{R}^{p}\), such that \((z(k))_{k}\) converges to a strict saddle point of f, has measure zero.

Now, we are ready to prove Theorem 2.6 by exploiting the analysis of the strict saddle points of \(L^{N}\) that has been performed in [5], extending [18, 26].

Proof

Due to definitions (2.12), (2.13) of the constants \(\delta _{\mathcal{B}}\), \(L_{\mathcal{B}}\), and \(M_{\mathcal{B}}\) together with condition (2.14) on the stepsizes \(\eta _{k}\), the conditions of Theorem 2.4 are satisfied for each initialization \(\overrightarrow{W}(0) \in \mathcal{B}\). Hence, \(\overrightarrow{W}(k)\) converges to a critical point of \(L^{N}\) for all \(\overrightarrow{W}(0) \in \mathcal{B}\). By Theorem 4.1, the convergence of gradient descent with initial values in \(\mathcal {B}\) and with stepsizes \(\eta _{k}\) to a strict saddle point occurs only for a subset of \(\mathcal{B}\) that has measure zero.

The rest of the proof is the same as the corresponding reasoning in the proof of [5, Theorem 6.12]. Let us repeat only the main aspects from [5]. Recall that \(q = \operatorname{rank}(Q)\) (cf. (2.6)), \(r = \min_{j=0,\ldots ,N} d_{j}\) and denote by \(\overrightarrow{W}=(W_{1},\ldots ,W_{N})\) the limit of \(\overrightarrow{W}(l)\), \(W= W_{N}\cdots W_{1}\) and \(k = \operatorname{rank}(W)\). Then, \(k \leq r\), and W is a critical point of \(L^{1}\) restricted to manifold \(\mathcal{M}_{k}\) of rank k matrices [5, Proposition 6.8(a)]. Then, [5, Proposition 6.6(1)] implies that \(k \leq q\). If W is not a global minimizer of \(L^{1}\) restricted to \(\mathcal{M}_{k}\), then W is a strict saddle point of \(L^{N}\) by [5, Proposition 6.9]. As argued above, the set of initializations converging to such a point has measure zero, showing part (a). (Note that for \(N \geq 3\) and \(k < \min \{r,q\}\) a global minimizer of \(L^{1}\) restricted to \(\mathcal{M}_{k}\) may correspond to a non-strict saddle point \(\overrightarrow{W}\) of \(L^{N}\), see [5, Proposition 6.10].) If \(N = 2\), then by [5, Proposition 6.11] any critical point \(\overrightarrow{W}=(W_{1},W_{2})\) such that \(W=W_{2} W_{1}\) is a global minimum of \(L^{1}\) restricted to \(\mathcal{M}_{k}\) for some \(k<\bar{r}\) is a strict saddle point of \(L^{2}\), which shows part (b) of the theorem. □

5 Numerical experiments

In this section, we illustrate our theoretical results with numerical experiments. In particular, we test convergence of gradient descent for various choices of constant and decreasing stepsizes and with \(N=2\), \(N=3\) and \(N=5\) layers.

The sample size is chosen as \(m=3\cdot d\) with \(d=70\). For our experiments, we generate our dataset \(X\in \mathbb{R}^{d_{x}\times m}\) randomly with entries drawn from a mean zero Gaussian distribution with variance \(\sigma ^{2}=1/d\), where \(d_{x}=d\). The data matrix \(Y \in \mathbb{R}^{d \times m}\) is a random matrix of rank \(r=2\), which is generated as described below. We initialize the weight matrices \(W_{j} \in \mathbb{R}^{d_{j} \times d_{j-1}}\) such that \(\overrightarrow{W}(0)=(W_{1},\ldots ,W_{N})\) is balanced, i.e., has balancedness constant 0 so that \(\alpha = 0\) in Theorems 2.4 and 2.6, in the following way. The rank parameter is chosen as \(r=2\) and the dimensions \(d_{j}\) as

$$ d_{0} = d,\qquad d_{1} = r,\qquad d_{j}= \operatorname{round} \biggl(r+(j-1) \frac{d-r}{N-1} \biggr),\quad j=2,\dots,N, $$

where \(\operatorname{round}(z)\) rounds a real number z to the nearest integer. We randomly generate orthogonal matrices \(U_{1} \in \mathbb{R}^{d \times d}\), \(V_{j} \in \mathbb{R}^{d_{j} \times d_{j}}\), \(j=1,\ldots ,N\), according to the uniform distribution on the corresponding unitary groups and let \(U_{j} \in \mathbb{R}^{d_{j} \times d_{1}}\), \(j=2,\ldots ,N\) be the matrix composed of the first \(d_{1}\) columns of \(V_{j-1}\). We then set

$$ W_{j} = V_{j} I_{d_{j},d_{1}} U_{j}^{T}, $$

where for any \(n_{1},n_{2}\in \mathbb{N}\) the matrix \(I_{n_{1},n_{2}} \in \mathbb{R}^{n_{1}\times n_{2}}\) is a rectangular diagonal matrix with ones on the diagonal. By orthogonality and construction of \(U_{j+1}\), it follows that for all \(j=1,\ldots ,N-1\), we have

$$\begin{aligned} W_{j+1}^{T} W_{j+1} &= U_{j+1} I_{d_{1},d_{j+1}} V_{j+1}^{T} V_{j+1} I_{d_{j+1},d_{1}} U_{j+1}^{T} = U_{j+1} U_{j+1}^{T} = V_{j} I_{d_{j},d_{1}} U_{j}^{T} U_{j} I_{d_{1},d_{j}} V_{j}^{T} \\ &= W_{j} W_{j}^{T} \end{aligned}$$

so that the tuple \((W_{1},\ldots ,W_{N})\) is balanced. The random matrix \(Y \in \mathbb{R}^{d \times m}\) of rank 2 is generated as \(Y = \widetilde{W_{N}} \cdots \widetilde{W_{1}}X\) with matrices \(\widetilde{W_{j}}\) generated in the same way as the matrices \(W_{j}\). We decided to choose a matrix Y of rank 2 so that the global minimizer of \(L^{1}\) is also of rank 2 and convergence to it means that \(L^{N}\) converges to zero, which is simple to check.

In our first set of experiments, we use a constant stepsize, i.e., \(\eta _{k} = \eta \). Using \(\alpha = 0\), the sufficient condition in Theorem 2.4 reads

$$ \eta \leq \frac{2 \delta}{4 L^{N}(\overrightarrow{W}(0)) + \delta B_{\delta}}, $$
(5.1)

with \(B_{\delta}\) in (2.9). We choose

$$ \delta = \frac{M^{\frac{2}{N}}}{N^{3}}. $$

This slightly differs from the choice of δ suggested by Remark 2.5(b), but corresponds to the choice of δ that we would obtain at this point using the bound given by Remark 3.3 (instead of Proposition 3.2) allowing us to set \(K_{\delta}= M^{\frac{2}{N}}+N^{2}\) (instead of \(K_{\delta}= M^{\frac{2}{N}}+(N+1)^{2}\)) in our results.

In Fig. 1, \(L^{N}(\overrightarrow{W}(k))\) is plotted versus the iteration number. For the plot 1a, the stepsize is chosen to exactly meet the upper bound in (5.1) (with \(\delta =M^{2/N}/N^{3}\)), resulting for this experiment in the values \(\eta =7.73\cdot 10^{-4}\), \(\eta = 1.29\cdot 10^{-4}\) and \(\eta = 3.91\cdot 10^{-5}\) for depth \(2,3\) and 5, respectively. For the plot 1b, the stepsize η is chosen somewhat smaller than the upper bound in (5.1), while for plots 1c and 1d the bound (5.1) is not satisfied. Since we observe convergence in plot 1c, this suggests that the bound of Theorem 2.4 may not be entirely sharp. However, increasing the stepsize beyond a certain value leads to divergence as suggested by plot 2d so that some bound on the stepsize is necessary (see also [8, Lemma A.1] for a necessary condition in a special case).

Figure 1
figure 1

Progress of loss function \(L^{N}\) for training linear networks via gradient descent for various values of the constant stepsize

Figure 2
figure 2

Gradient descent with decreasing stepsizes \(\eta _{k}\) as in (5.2)

In our second set of experiments, we use a sequence of stepsizes \(\eta _{k}\) that converges to zero at various speeds. For some decay rate \(\gamma \geq 0\) and some constants \(a_{1}, a_{2}\), we set

$$\begin{aligned} \eta _{k}=\min \biggl\{ a_{1}, \frac{a_{2}}{(k+1)^{\gamma}} \biggr\} \gamma \geq 0,\quad \text{for all } k \in \mathbb{N}. \end{aligned}$$
(5.2)

The upper bound of Theorem 2.4 is satisfied for (see also the beginning of the proof of Proposition 3.4)

$$\begin{aligned} a_{1}= a_{2} = \frac{2(1-\sigma )}{B_{\delta}}, \qquad\sigma = \frac{4 L^{N}(\overrightarrow{W}(0))}{4 L^{N}(\overrightarrow{W}(0)) + \delta B_{\delta}}. \end{aligned}$$
(5.3)

Again, we choose \(\delta =\frac{1}{N^{3}} M^{\frac{2}{N}}\), which corresponds to the choice of δ using the bound given in Remark 3.3 when testing with these values for \(a_{1}\) and \(a_{2}\).

The plots in Fig. 2 illustrate the convergence behavior for various choices of the constants \(a_{1}\), \(a_{2}\) and decay rate γ in (5.2), for \(N=2,3,5\). Plot 2a and 2b show convergence for the choices \(a_{1}, a_{2}\) in (5.2) and for \(\gamma = 0.2\) and \(\gamma = 0.4\), respectively, leading to stepsizes satisfying the condition of Theorem 2.4. In these experiments, the resulting values of \(a_{1}=a_{2}\) are \(a_{1} = 7.73 \cdot 10^{-4}\) for \(N=2\), \(a_{1} = 1.29 \cdot 10^{-4}\) for \(N=3\) and \(a_{1} = 3.91 \cdot 10^{-5}\) for \(N=5\). Comparing the two plots, as well as with the plots for constant stepsize in Fig. 1, shows that fast decay of the step size leads to slower convergence of gradient descent, as expected. Note that we observe that larger values of γ are possible but will further slow down convergence, so we decided to omit the corresponding experiments here.

Plot 2c shows convergence for a decay rate of \(y = 0.2\) even though the constants \(a_{1}\) and \(a_{2}\) are such that \(\eta _{k}\) does not satisfy the bound of Theorem 2.4 for all k, while further increasing the value of \(a_{2}\) leads to divergence as illustrated in Plot 2d.

6 Conclusion

In this article, we analysed convergence properties of GD for learning linear neural networks. We established the boundedness of GD iterates and proved its convergence to a critical point of the square loss under suitable conditions on the stepsizes. We then extended the convergence results towards a global minimum in [5] from gradient flow to gradient descent. Our work provides precise conditions that ensure convergence for both constant and decreasing stepsizes. Moreover, our maximal allowed stepsize does not vanish exponentially with the number of layers, and we also showed numerically that violating the bound for our stepsizes may result in divergence. We believe that our findings will contribute to the analysis of nonlinear neural networks. Extending the insights of this study from gradient descent to stochastic gradient descent is reserved for future work.

Data availability

Not applicable.

References

  1. Absil, P.-A., Mahony, R., Andrews, B.: Convergence of the iterates of descent methods for analytic cost functions. SIAM J. Optim. 16(2), 531–547 (2005)

    Article  MathSciNet  Google Scholar 

  2. Arora, S., Cohen, N., Golowich, N., Hu, W.: A convergence analysis of gradient descent for deep linear neural networks. In: International Conference on Learning Representations (2019)

    Google Scholar 

  3. Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: implicit acceleration by overparameterization. In: International Conference on Machine Learning (2018)

    Google Scholar 

  4. Arora, S., Cohen, N., Hu, W., Luo, Y.: Implicit regularization in deep matrix factorization. In: Advances in Neural Information Processing Systems, pp. 7413–7424 (2019)

    Google Scholar 

  5. Bah, B., Rauhut, H., Terstiege, U., Westdickenberg, M.: Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers. Inf. Inference 11(1), 307–353, (2022).

    Article  MathSciNet  Google Scholar 

  6. Bartlett, P., Helmbold, D., Long, P.: Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. In: International Conference on Machine Learning, pp. 521–530. PMLR (2018)

    Google Scholar 

  7. Chitour, Y., Liao, Z., Couillet, R.: A geometric approach of gradient descent algorithms in neural networks (2018). Preprint. arXiv:1811.03568

  8. Chou, H., Gieshoff, C., Maly, J., Rauhut, H.: Gradient descent for deep matrix factorization: dynamics and implicit bias towards low rank (2020). Preprint. arXiv:2011.13772

  9. Davis, D., Drusvyatskiy, D., Kakade, S., Lee, J.D.: Stochastic subgradient method converges on tame functions. Found. Comput. Math. 20(1), 119–154 (2020)

    Article  MathSciNet  Google Scholar 

  10. Du, S., Hu, W.: Width provably matters in optimization for deep linear neural networks. In: International Conference on Machine Learning, pp. 1655–1664. PMLR (2019)

    Google Scholar 

  11. Du, S.S., Hu, W., Lee, J.D.: Algorithmic regularization in learning deep homogeneous models: layers are automatically balanced. In: ICML 2018 Workshop on Nonconvex Optimization (2018)

    Google Scholar 

  12. Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. In: International Conference on Learning Representations (2019)

    Google Scholar 

  13. Elkabetz, O., Cohen, N.: Continuous vs. discrete optimization of deep neural networks. In: Thirty-Fifth Conference on Neural Information Processing Systems (2021)

    Google Scholar 

  14. Geyer, K., Kyrillidis, A., Kalev, A.: Low-rank regularization and solution uniqueness in over-parameterized matrix sensing. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, pp. 930–940 (2020)

    Google Scholar 

  15. Gunasekar, S., Woodworth, B.E., Bhojanapalli, S., Neyshabur, B., Srebro, N.: Implicit regularization in matrix factorization. In: Advances in Neural Information Processing Systems, pp. 6151–6159 (2017)

    Google Scholar 

  16. Hu, W., Xiao, L., Pennington, J.: Provable benefit of orthogonal initialization in optimizing deep linear networks (2020). arXiv preprint. arXiv:2001.05992

  17. Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, Berlin (2016)

    Chapter  Google Scholar 

  18. Kawaguchi, K.: Deep learning without poor local minima. Adv. Neural Inf. Process. Syst. 29, 586–594 (2016)

    Google Scholar 

  19. Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. In: International Conference on Learning Representations (2017)

    Google Scholar 

  20. Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid strict saddle points. Math. Program. 176(1), 311–337 (2019)

    Article  MathSciNet  Google Scholar 

  21. Mertikopoulos, P., Hallak, N., Kavis, A., Cevher, V.: On the almost sure convergence of stochastic gradient descent in non-convex problems. Adv. Neural Inf. Process. Syst. 33, 1117–1128 (2020)

    Google Scholar 

  22. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Berlin (2014)

    Google Scholar 

  23. Panageas, I., Piliouras, G., Wang, X.: First-order methods almost always avoid saddle points: the case of vanishing step-sizes. In: Conference on Neural Information Processing Systems (2019)

    Google Scholar 

  24. Razin, N., Cohen, N.: Implicit regularization in deep learning may not be explainable by norms. In: Conference on Neural Information Processing Systems (2020)

    Google Scholar 

  25. Shamir, O.: Exponential convergence time of gradient descent for one-dimensional deep linear neural networks. In: Conference on Learning Theory, pp. 2691–2713. PMLR (2019)

    Google Scholar 

  26. Trager, M., Kohn, K., Bruna, J.: Pure and spurious critical points: a geometric study of linear networks. In: International Conference on Learning Representations (2020)

    Google Scholar 

  27. Wu, L., Wang, Q., Ma, C.: Global convergence of gradient descent for deep linear residual networks (2019). arXiv preprint. arXiv:1911.00645

  28. Yun, C., Krishnan, S., Mobahi, H.: A unifying view on implicit bias in training linear neural networks. In: International Conference on Learning Representations (2021)

    Google Scholar 

  29. Yun, C., Sra, S., Jadbabaie, A.: Global optimality conditions for deep neural networks. In: International Conference on Learning Representations (2018)

    Google Scholar 

  30. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (2017)

    Google Scholar 

  31. Zou, D., Long, P.M., Gu, Q.: On the global convergence of training deep linear resnets. In: International Conference on Learning Representations (2020)

    Google Scholar 

Download references

Acknowledgements

All authors acknowledge funding by DAAD (German Foreign Exchange Service) through the project Understanding stochastic gradient descent in deep learning (grant no: 57417829).

Funding

This work was supported by DAAD (German Foreign Exchange Service) through the project Understanding stochastic gradient descent in deep learning (grant no: 57417829).

Author information

Authors and Affiliations

Authors

Contributions

All authors read and approved the final manuscript.

Corresponding author

Correspondence to Gabin Maxime Nguegnang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguegnang, G.M., Rauhut, H. & Terstiege, U. Convergence of gradient descent for learning linear neural networks. Adv Cont Discr Mod 2024, 23 (2024). https://doi.org/10.1186/s13662-023-03797-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13662-023-03797-x

Keywords