 Research
 Open access
 Published:
Universal approximation property of a continuous neural network based on a nonlinear diffusion equation
Advances in Continuous and Discrete Models volume 2023, Article number: 43 (2023)
Abstract
Recently, differential equationbased neural networks have been actively studied. This paper discusses the universal approximation property of a neural network that is based on a nonlinear partial differential equation (PDE) of the parabolic type.
Based on the assumption that the activation function is nonpolynomial and Lipschitz continuous, and applying the theory of the difference method, we show that an arbitrary continuous function on any compact set can be approximated using the output of the network with arbitrary precision. Additionally, we present an estimate of the order of accuracy with respect to △t and △x.
1 Introduction
Recently, neural networks have been applied in numerous fields, both in social and natural sciences. However, their performance remains a topic of active research. Since Rosenblatt’s work [60], neural networks have been studied extensively. In fact, the set of functions realized by neural network models has been under discussion for some time.
Surprisingly, the transform mapping theorem, which is similar to the universal approximation property, was derived in an early research by Kolmogorov [41] and its simplified proof was provided by Sprecher [70]. However, the neural networks they considered differed slightly from existing conventional implementations. Later, in the 1980s, several studies were conducted on the universal approximation property of neural networks. On the one hand, these results greatly encouraged and facilitated research in neural networks. On the other hand, they found the universal approximation property of neural network models to be closely related with (almost simultaneous) controllability in the theory of optimal control. However, there are some differences between the two. When discussing the universal approximation property of a neural network, these works typically include the effect of the output layer, whose activation function may differ from that of the hidden layer. Arguments concerning these areas are introduced and discussed in detail in the next section. Moreover, some recent studies have considered neural networks from the perspective of optimal transport [67, 68, 78]. These arguments have led to the application of the dynamical system theory to neural networks.
For example, E [83] regarded a neural network as a method of estimating the parameters of a dynamical system. In particular, he formulated ResNet [29] as an Euler scheme for an ordinary differential equation (ODE) and discussed its stability in the forward direction. They deduced certain conditions under which forward propagation operates stably in the sense that gradient explosion and vanishing problems do not occur when the eigenvalues of the system are considered. Additionally, they highlighted a close relationship (or even equivalence) between the adjoint equation and backpropagation and introduced a regularity method.
This dynamical systemsbased approach toward neural networks became more popular after a study by Chen et al., which provided a framework for representing a neural network with an ODE solver. This framework was referred to as the neural ODE [10].
Thereafter, neural ODEs began to be widely used and implemented [10].
Meanwhile, some methods have been proposed based on ODEs and partial differential equations (PDEs) [28, 31]. Han and Li [28] formulated a neural network using an ODE, considering a cost function optimized using the Hamilton–Jacobi–Bellman (HJB) equation. In our previous study [31], we proposed a framework for a neural network in which we considered the initialboundary value problem for PDEs.
A maximum principlebased approach was also provided in [44]. Notably, some recent works have actively discussed the application of differential equations to graph neural networks (GNNs) (see, for instance, [9]), along with the “expressive power” and “stability” of GNNs. Oono and Suzuki [56] discussed that the expressive power of a GNN decreases when the it has an excessively large numbers of layers. They also proposed a concept called “oversmoothing” in which the feature vectors of all nodes tend to reach an equivalent state. This has driven ongoing research on the diffusion process of GNN models [9], which is related to the topic of the present work. From this perspective, the authors of [9] worked on the application of a range of differential equations that are popular in classical physics; see, for instance, [62]. Of note, they also considered PDEs with a diffusion term, as used in works on image processing [40, 58]. The results of these studies motivate us to consider a PDE with a diffusion term here. This study is also motivated by ongoing research on optimal control theory, especially work on the ensemble controllability of stochastic processes in terms of the Fokker–Planck equation [2]. Although the drift term differs slightly from that used here, this highlights the necessity of the control of diffusion PDEs. Insights obtained in studies on machine learning literature might be helpful in this regard. Along these lines, we consider neural networks based on PDEs with a diffusion term in our study. Our motivations are twofold.

(i)
Although neural ODEs perform well, their essential difference from classical neural networks is that the width of each layer does not change. This limitation can be overcome by PDEbased neural networks, which also consider the infinite limit of the width of the network. Because we aim to approximate a neural network with a continuous dynamical system, this advantage of PDEbased neural networks appears to be more natural.

(ii)
Similar to that in the case of ODEbased control, some fruitful theories have also been provided on PDEbased control (or distributed control). A sophisticated theoretical framework has been developed in considerable literature on diffusion equations. We can also understand the increasing freedom of such models by considering a range of forms and values of boundary conditions.
However, some uncertainties remain regarding the performance of these continuous neural networks. For example, the universal approximation property is an important aspect that all neural networks must exhibit.
Although various types of neural networks based on dynamical systems have been developed, some scope for further exploration remains in terms of their universal approximation property based on a PDE, particularly with a diffusion term.
In this paper, we first introduce the formulation of a PDEbased neural network and then show that it is welldefined under some natural setup conditions. Next, we prove the existence of a temporally global solution to the model. We also posit the existence of a vanishing diffusion limit. Finally, we show that our model possesses a universal approximation property with respect to the maximum norm.
The remainder of this paper is organized as follows. In Sect. 2, we define some notations that we use throughout this paper. In Sect. 3, we formulate the research problem and introduce some existence theorems. In Sect. 4, we present our main result, which is proven in Sect. 7. In Sect. 5, we compare our results to those of related works, referring to the history of arguments on the universal approximation property of neural networks. We also confirm the main contributions of the present work, and clarify our key theoretical and practical insights. Section 6 provides some preliminary statements to support the main results presented in Sect. 7. In Sect. 8, we discuss the learnability of our model as well as its performance based on some numerical experiments. Finally, our conclusions and some possible avenues for future research are presented in the final section.
2 Notations
In this section, we introduce some notations used for general analysis. First, let us define \(I=(0,1)\) and \(\partial I = \{0\} \cup \{1\}\). Let \({\mathcal {G}}\) denote an arbitrary region in \({\mathbb{R}}\). We denote the closure of \({\mathcal {G}}\) as \(\overline{\mathcal {G}}\).
Hereafter, \(C({\mathcal {G}})\) denotes a set of continuous functions on \({\mathcal {G}}\). For \(r \in {\mathbb{N}}\), a set of functions that are rtimes continuously differentiable on \({\mathbb{R}}\) is denoted as \(C^{r}({\mathbb{R}})\). A set of infinitely differentiable functions with a compact support in \({\mathcal {G}}\) is denoted as \(C_{0}^{\infty}({\mathcal {G}})\). A set of Lipschitz continuous functions on \({\mathbb{R}}\) is denoted as \(C^{L}({\mathbb{R}})\). For \(d \in {\mathbb{N}}\), we often denote a vector \(\vec{u}=(u_{1},u_{2},\dots ,u_{d}) \in {\mathbb{R}}^{d}\) as \([u_{j}]_{j}\). For two vectors u⃗ and \(\vec{v} \in {\mathbb{R}}^{p}\) in general, we denote their inner product as \(\vec{u}\cdot \vec{v}\). For a vector space X and an element \(\vec{v} \in X\), we denote a set spanned by v⃗ as \(\operatorname{Span}\langle \vec{v} \rangle \).
Let \(\\cdot \_{L_{p}({\mathcal {G}})}\) denote the usual \(L_{p}\) norm with \(1 \leq p\leq +\infty \) on \({\mathcal {G}}\); i.e., for a function f in general, we define
We use a notation \((\cdot ,\cdot )_{\mathcal {G}}\) to denote the inner product in \(L_{2}({\mathcal {G}})\) space:
In particular, when the region is clear, we simply denote it as \((\cdot ,\cdot )\). The norm in \(L_{2}({\mathcal {G}})\) is often denoted as \(\cdot \). We also use this notation to denote the norm in the Euclidean space, where a step function is regarded as a simple function in the \(L_{2}\) space.
For \(r \in {\mathbb{N}}\), we define Sobolev spaces \(H^{r}({\mathcal {G}})\), which are the spaces of functions \(f(x), x\in {\mathcal {G}}\), equipped with the norm \(\f\_{H^{r}({\mathcal {G}})}^{2} \equiv \sum_{ \alpha \leq r} \D^{\alpha}f \_{L_{2}({\mathcal {G}})}^{2}\). \(H^{r}({\mathcal {G}})\) (\(r>0\)) is defined as the dual space of \(H_{0}^{r}({\mathcal {G}})\), which is the closure of \(C_{0}^{\infty}({\mathcal {G}})\) with respect to the norm of \(H^{r}({\mathcal {G}})\) (see, [49], §11.1 and §12.1).
For a Banach space \({\mathcal {B}}\) with the norm \(\\cdot \_{\mathcal {B}}\), we denote the space of \({\mathcal {B}}\)valued measurable functions f on the interval \((a,b)\) by \(L_{p}((a,b);{\mathcal {B}})\), the norm of which is defined by
Similarly, we often use notations like \(C([a,b];{\mathcal {B}})\) to denote sets of \({\mathcal {B}}\)valued functions that are continuous with respect to time on the interval specified as the brackets. We also denote an adjoint space of \({\mathcal {B}}\) by \({\mathcal {B}}^{\prime}\). For a Hilbert space H and its linear subspace \(M \subset H\) in general, we denote the orthogonal complement of M as \(M^{\perp}\). When the inner product of two elements \(v_{1}\) and \(v_{2}\) in H vanishes, we use the notation \(v_{1} \perp v_{2}\). Hereafter, we shall use the notations below:
where \(Z(t,x,y)\) denotes the fundamental solution to the initial boundary value problem of the heat equation with the vanishing Dirichlet condition. Given \(T>0\), we use the notation \({\mathcal {H}}_{T} \equiv (0,T)\times I\times I\). The other notations used in this paper are summarized in Table 4 in Appendix A.
3 Formulation: differential equationbased neural networks
Here, we formulate the continuous limit of a multilayer neural network [32]. Because in the supervised learning, the input vector takes the form of a vector in a Euclidean space \({\mathbb{R}}^{J}\), we represent it as a simple function on a unit interval by partitioning it into J intervals. Given \(T>0\), we formulated the continuous version of a neural network as follows:
where \(\nu >0\), \(\phi (\cdot )\) denotes the activation function, \(T>0\) corresponds to the depth of a classical neural network, and \(w_{1}\) and \(w_{0}\) are the weight parameters at the middle and output layers, respectively. Additionally, we impose the initial and boundary value conditions as follows:
We employ a nonvanishing Dirichlet condition in (3.2), with which we can easily assure the existence of a solution in (6.12) in the proof of Lemma 4. We shall comment on this issue later again. In (3.2), given the input data \(\vec{\xi}=(\xi _{1},\xi _{2},\ldots ,\xi _{J})^{\top }\in {\mathbb{R}}^{J}\), the initial data is a simple function of the form: \(u_{0}(x) = \sum_{j=1}^{J} \xi _{j} \chi _{I_{j}}\), with \(\chi _{I_{j}}\) as the indicator functions of \(I_{j} \equiv ((j1)/J, j/J] \) (\(j=1,2,\ldots ,J\)). Because we usually deal with a finite dimensional input, we translate it into this finite dimensional vector, and the corresponding simple function on a unit interval I. This is a different formulation from that of Liu and Markowich [52], in which they employed a region of the same dimension as the input feature. In that model, they computed the multiple integral of the input over the ddimensional space in each layer. In the case of a twodimensional CNN, their formulation coincides with the functionality of the convolution layer. In higher dimensions, however, it is different from how the multilayer neural network works in the usual supervised learning.
By taking \(v\equiv u1\), we can transform problem (3.1)–(3.2) as below.
The following result was obtained for problem (3.3).
Theorem 1
Let \(T>0\) be arbitrary, and the following be assumed:

(i)
\(u_{0} \in L_{2}(I)\),

(ii)
\(\phi \in C^{L}({\mathbb{R}})\),

(iii)
\(w_{1} \in L_{2}({\mathcal {H}}_{T})\).
Then, there exists a constant \(T_{u_{0}} \in (0,T]\) that depends on \(\u_{0}\_{L_{2}(I)}\) such that problem (3.3) has a unique solution \(v\in C([0,T_{u_{0}}];L_{2}(I))\) on the interval \([0,T_{u_{0}})\). In addition, this solution satisfies
where \(c(\u_{0}\_{L_{2}(I)})\) is a positive constant that depends monotonically increasingly on \(\u_{0}\_{L_{2}(I)}\).
We prove this theorem in Appendix B, in which we use the notation \(A = \nu \frac{\partial ^{2}}{\partial x^{2}}\) and define a sesquilinear form \(\sigma (\cdot ,\cdot ):H_{0}^{1}(I) \times H_{0}^{1}(I) \rightarrow { \mathbb{R}}\) [21] by
Remark 1
The solution v mentioned in Theorem 1 also belongs to the space [64]
and satisfies the same estimates as the one in the theorem with the norm of these spaces. The proof of this fact is contained in the proof of Theorem 1 in Appendix B.
Remark 2
In our proof above, we do not require \(\phi (\cdot )\) to satisfy \(\phi (0)=0\), nor the linear growth, as required in [52].
Next, we show the existence of a temporally global solution.
Theorem 2
Let \(T>0\) be an arbitrary positive number and assume that in addition to the assumptions (i), (ii) of Theorem 1, \(w_{1} \in L_{2}({\mathcal {H}}_{\infty})\) is satisfied. Then, there exists a temporally global solution \(v \in C([0,T];L_{2}(I))\) to problem (3.3), which satisfies
where \(\chi (\cdot )\) is a monotonically increasing function.
Remark 3
As in Theorem 1, the solution v mentioned in Theorem 2 also belongs to the space
and satisfies the same estimates as the one in the theorem with the norm of these spaces.
The proof of Theorem 2 is given in Appendix B as well. Note that the estimate above does not depend on the diffusion coefficient \(\nu >0\). Thus, under the assumptions of Theorem 2, we can let ν tend to zero, to assert the corollary below [47].
Corollary 1
Under the assumptions of Theorem 2, if we denote the solution to (3.3) by \(v^{(\nu )}\), then we can take a sequence \(\{v^{(\nu _{m})}\}_{m=1}^{\infty }\subset L_{2}(I_{T})\) satisfying the following:
where \(v^{(0)} \in L_{2}(I_{T})\) is a solution to the hyperbolic equation
In our previous studies [31, 32], we set several cost functions that corresponded to specific tasks and demonstrated the presence of optimal controls, and we used the gradient descent algorithm to find the suboptimal control. For example, in [34] in which we discussed the multiclass classification problem, the cost function is given by
where \(\phi _{0}(\cdot )\) is an activation function of the output layer, \(P(\vec{X},\vec{t}_{(\vec{X})})\) is the probability distribution of \((\vec{X},\vec{t}_{(\vec{X})})\), and \(t_{(\vec{X}),k} \in \{0,1\}\) satisfies \(\sum_{k=1}^{K} t_{(\vec{X}),k}=1\) \(\forall \vec{X} \in {\mathbb{R}}^{J}\). However, because we consider the feedforward network hereafter, we do not consider \(\phi _{0}(\cdot )\) in the present paper. Instead, we discuss the universal approximation property of this neural network with the output layer of the linear unit.
Hereafter, we frequently represent the solution to (3.3) (or equivalently, (3.1)–(3.2)) as \(u(t,x;w_{1},\vec{\xi})\), to clarify its dependency on \(w_{1}\) and ξ⃗. Thus, we regard the solution \(u(T,x;w_{1},\vec{\xi})\) as a function on K by identifying \(u_{0}\) with \(\vec{\xi} = (\xi _{1},\ldots ,\xi _{J})^{\top}\).
4 Main result
In this section, we show the universal approximation property of the partial differential equationbased neural network prescribed in Sect. 3. We discuss the universal approximation property of our neural network model based on a nonlinear partial differential equation [32]. As is the case with the previous works, we restrict ourselves to an arbitrary compact set \(K \subset {\mathbb{R}}^{J}\). Our main result is
Theorem 3
Let \(T>0\) be given and \(\phi \in C^{L}({\mathbb{R}})\) be a nonpolynomial function. Then, for an arbitrary compact set \(K \subset {\mathbb{R}}^{J}\), \(F \in C(K)\), and \(\varepsilon >0\), there exist \(w_{0} \in L_{2}(I)\), \(w_{1} \in L_{2}({\mathcal {H}}_{T})\) such that
where \(u(T,x;w_{1},\vec{\xi})\) is the value of a solution to (3.1)–(3.2) at time T that corresponds to the initial input value ξ⃗.
We will prove Theorem 3 in Sect. 8.
Remark 4
In this paper, we only consider the scalarvalued function \(F:K\rightarrow {\mathbb{R}}\) as Leshno [43] did. This does not lose generality, because if we can approximate this function, then we can approximate an arbitrary continuous map \(F:K\rightarrow {\mathbb{R}}^{n}\) by concatenating the network in parallel, as is done in [14], as long as \(J \geq n\) holds. We also point out that our PDEbased neural network is defined only on onedimensional Euclidean space. This is because in many supervised learning, the input data is a vector with independent attributes, which can be associated with a simple function on I as we did above. This is the similar approach with [72]. When we consider GNN, however, this assumption does not hold, which is one of our future works.
Remark 5
The controls \(w_{0}\) and \(w_{1}\) depend on T and ν. Therefore, we cannot assure at this moment that the same conclusion holds with \(\nu =0\). The discussion concerning this vanishing diffusion limit is our future work.
5 Comparison with existing works
Before going on to the proofs of our results, we discuss here the difference and novelty of our result in comparison to the existing related works. Actually, numerous contributions to the literature have studied the universal approximation property of neural networks. As an early work, Lippmann [51] postulated the formation of a range of surfaces for classification tasks to classify the points in a topological space using a neural network with two hidden layers.
This conjecture was rigorously proven by Funahashi [22], who stated that an arbitrary continuous function on a compact subset K in \({\mathbb{R}}^{n}\) could be approximated with a neural network with a single hidden layer that contained a sigmoid activation function.
Funahashi [23] also hypothesized that any \(L_{2}\) function could be approximated by a threelayer neural network with a finite number of units in the hidden layer. Fourlayer networks have also been conjectured to outperform threelayer networks. These considerations are related to the study of the generalization performance of neural networks [6].
Irie and Miyake [36] derived the integral representation of threelayer neural networks based on the Fourier integral theorem under the continuity of the hidden layer.
Around the same time, Cybenko [11] first discussed the universal approximation property of sigmoidal functions. They showed that a set of the functions of the form \(\sum_{j} w_{j} \sigma (\vec{y}^{\top }\vec{x} + b)\) with some constants \(w_{j}\) and b and vectors x⃗ and y⃗ in a compact space K is dense in \(C(K)\). Their discussion did not assume the activation function to be monotonic.
Hornik, Stinchcombe, and White [35] proposed general measurable functions by making use of the StoneWeierstrass Theorem and the cosine squasher proposed by Gallant and White. Their results can be regarded as similar to those of Funahashi [22].
Leshno [43] obtained more general results by using the fact that the set of functions spanned by the socalled ridge functions, i.e., those of the form \(f(\vec{w}\top \vec{x}+\theta )\), is dense both in \(C({\mathbb{R}}^{n})\) and \(L_{p}(\mu )\), where μ is an arbitrary finite measure on \({\mathbb{R}}^{n}\). Recently, Yun [85] proved the approximation property of a neural network constructed using a parametric sigmoidal function.
Some Bayesian perspectives on neural networks, even with an infinite number of nodes, have been discussed (see, for example, [54, 84]). Their key insight is that as the number of nodes tends to infinity, the output can be regarded as a set of Gaussian processes.
However, all these studies considered only general neural networks. Many works have also considered the universal approximation property of neural networks based on continuous dynamical systems.
Haber and Ruthotto [27] proposed a formulation of a neural network in a supervised learning framework as a dynamical system. There, they clarified the necessary condition for the stability of an equilibrium point as well as the stability of Euler method as a discrete approximation of the continuous solution. They also pointed out the close relationship between backpropagation and the adjoint method in optimal control theory. Q. Li et al. [45] discussed the approximation property of an ODEbased neural network, and gave the sufficient condition under which the set of the realizations of an ODEbased neural network can approximate an arbitrary continuous map \(f:{\mathbb{R}}^{n} \rightarrow {\mathbb{R}}^{m}\) (\(n \geq 2\)) on any compact set with respect to \(L_{p}\) (\(p \in [1,+\infty )\)) norms.
Along these lines, Aizawa and Kimura [1] recently presented the universal approximation property of neural ODEs [10] and ResNet using the result of Leshno [43]. However, their method is restricted to linear models.
Esteve [18] presented the recent works concerning the approximation property of neural ODEs and, moreover, presented that the optimal value of the loss function is estimated above to the order of \(T^{1}\) under the Tikhonov regularization (which they called an empirical risk minimization). Roughly contemporaneously, Teshima [76] also investigated the universal approximation property of neural ODEs. In their proof, they made use of their previous work [75] with a relatively slight modification. They also discussed the relationship between their result and a preceding work by Zhang et al. [88], which showed a counterexample that cannot be approximated by a neural ODE. Zhang et al [88] also presented the universal approximation property of an augmented neural ODE [17].
Recently, a survey by DeVore et al. [14] thoroughly presented the existing results on the approximation property of neural networks. The power of Rectified Linear Unit (ReLU) networks was among the most important results introduced here, as they can contain all piecewisecontinuous functions on an arbitrary compact set.
From the perspective of a practical application, Laakmann and Petersen [42] applied a neural network to the numerical computation of a transport equation.
Studies in the field of optimal control have also considered the universal approximation property of continuous neural networks.
Balet and Zuazua [61] proved the simultaneous controllability [48] of a flow map of an ODE. This means that given an arbitrary finite input in a Euclidean space, the flow map can lead to an arbitrary set of classification labels.
By making use of this property, they also showed that an arbitrary simple function, and consequently an \(L_{2}\) map \(f:{\mathbb{R}}^{d} \rightarrow {\mathbb{R}}^{d}\) can be approximated with arbitrary precision with respect to the \(L_{2}\) norm. They also discussed the relationship of the universal approximation property and simultaneous control [48, 53]. However, their method is not applicable here for two reasons. First, their method of rotating the coordinate does not suffice because we consider equations with a diffusion term. Second, because we aim at an approximation with respect to the maximum norm, their method is not applicable, as it divides the region into two sections, in one of which the function is allowed to be discontinuous. From the perspective of optimal control theory, the universal approximation property corresponds to approximate ensemble controllability [53]. Thus, our arguments here can also be regarded as describing this property of a specific type of control via a nonlinear diffusion equation. We prove this property of our model using some results from studies on machine learning.
The relationship between the optimal control of neural network and optimal transport models has been pointed out as well (see, for instance, [50]). For example, Sontag and Sussmann [69] discussed the controllability of temporally continuous recurrent neural networks. Balet et al. [61] above also argued this point and studied a nonlinear transport equation, which they called a neural transport equation (NTE), as given below.
They proved a method to approximate a target measure in the form of a finite combination of Dirac measures by the solution of an NTE at \(t=T\) with arbitrary precision in the sense of 1Wasserstein distance.
In [45], the authors theoretically considered the formulation of an ODEbased neural network and proved its universal approximation property. They first observed that the earlier discussion concerning the universal approximation property of neural ODEs [88] relied on a stronger assumption under which the righthand side of an ODE already possesses the universal approximation property. They showed that any arbitrary continuous function \(f:{\mathbb{R}}^{n} \rightarrow {\mathbb{R}}^{n}\) on any compact set in a Euclidean space can be approximated in an \(L_{p}\) norm with arbitrary precision. They also pointed out that the set of realizations of an ODE is uniformly approximated by that of ResNet. They derived their main results based on another work by one of the authors [66]. However, in their formulation, they distinguished the onedimensional and multidimensional input cases. In contrast, our proof in the present work need not distinguish these cases, because we start from Leshno’s result [43].
Regarding ResNet, Tabuada [73] gave some conditions on activation functions under which the universal approximation property of a map \(f;{\mathbb{R}}^{n} \rightarrow {\mathbb{R}}^{n}\) with \(L_{\infty}\) norm is assured. They used the technique of ensemble controllability and deduced the quadratic differential equation that should be satisfied for each activation function. Research is also being actively conducted on the controllability of systems driven by linear or nonlinear partial differential equations [4].
For example, FernándezCara et al. [20] studied the null controllability of a heat equation with a spatially nonlocal term, which is roughly similar to the setup considered in the present work. Because their model was linear, they considered its adjoint equation and applied the Fourierseries representation of the equation and the compactnessuniqueness argument. They also stated that the approximate controllability, which is equivalent with the universal approximation property in the terminology of our PDEbased neural network, holds under the analyticity assumption of the kernel operator of the nonlocal term. However, our setup employs a nonlinear activation function, which is essentially different from this work; they listed the nonlinear case as an open problem.
As another example, [26] discussed the controllability of a nonlinear heat equation with a distributed control in an unbounded domain in \({\mathbb{R}}^{n}\). In this formulation, however, the control term is not included in the nonlinear term, which differs from our framework.
In fact, the application of diffusion equations in image processing has been discussed in prior works [82, 86]. Along these lines, Ruthotto and Haber [63] proposed parabolic and hyperbolic CNNs models that respectively included spatial and temporal secondorder derivative terms. They also considered the application of neural networks in this field as an extension of prior applications of PDEs.
Some other works have also addressed the control of parabolic PDEs [4], including linear and nonlinear heat equations. In this regard, the present work provides a link between the insights in the literature on neural networks and research on controllability in optimal control theory.
Regarding the PDEbased neural network, Liu and Markowich [52] proposed a hyperbolic nonlinear integrodifferential form without a diffusion term. However, they considered only the mathematical wellposedness of the form, and did not mention the universal approximation property. An earlier work [46] also proposed PDEbased neural networks for the transport and HJB equations, one of which used a diffusion term as in the present work. Neither of these works, however, mentioned the universal approximation properties of the models. Li and Shi [46] also proposed adding an extra constraint in cases with a diffusion term. In contrast, the present work shows that the universal approximation property is satisfied even without a trick of this nature.
At the end of this section, we list some recent notable results. Ivan et al. [59] proposed a framework to train neural ODEs using the Lyapunov function, which avoids the traditional backpropagation and achieves a faster computation. Moreover, a link between turnpike theory and optimal control has been considered in relation to neural ODEs [24]. Geshkovski and Zuazua [24] computed some examples of the turnpike property of a neural ODE using the MNIST dataset. They also mentioned that related results have been reported in the form of specific setups as in [18] and [19].
Based on the aforementioned arguments, the main contributions of this study are summarized as follows.

(i)
Motivated by the application of diffusion equations in image processing and considering GNNs with a diffusion term, we formulate PDEbased neural networks with a diffusion term and rigorously clarify the conditions under which the existence of the solution is assured.

(ii)
We describe the universal approximation property of our model in the sense of the maximum norm.
Our key findings are summarized as follows.

(i)
We show that some insights from studies on machine learning can be applied to the theory of the optimal control of PDEs.

(ii)
Even though Leshno’s result (Lemma 1 below) is a useful tool to prove our result for continuous neural networks, some additional formulations are required to discuss the convergence of the temporal and spatial unit when they tend to 0.

(iii)
Because our model contains a diffusion term, our method differs from those presented in prior works, although it is based on that reported by Leshno [43]. More concretely, our proof uses estimates of the approximation of the discretized diffusion equation that were not considered in previous studies.
In subsequent sections, we prove the results presented above.
6 Preliminary results
Before proving Theorem 3, we prepare some auxiliary results in this section. We first cite the following lemma ([43], Theorem 1).
Lemma 1
Let f be a measurable function on \({\mathbb{R}}\) with a certain \(J \in {\mathbb{N}}\). Then, \(\operatorname{Span} \langle f_{{\mathbf{w}},\theta}(x) \rangle \) (\({\mathbf{w}} \in {\mathbb{R}}^{J}\), \(\theta \in {\mathbb{R}}\)) is fundamental in \(C({\mathbb{R}}^{J})\) if and only if f is not a polynomial, where \(f_{{\mathbf{w}},\theta}(x)= f({\mathbf{w}}\cdot {\mathbf{x}} + \theta ) \).
Owing to this lemma, for an arbitrary \(\vec{\xi} \in K \subset {\mathbb{R}}^{J}\) and \(\varepsilon >0\), by taking a suitable \(M \in {\mathbb{N}}\), \(\{\sigma _{0(m)}\}_{m=1}^{M} \subset {\mathbb{R}}\), \(\{\vec{\sigma}_{1}^{(m)}\}_{m=1}^{M} \subset {\mathbb{R}}^{J}\), and \(\{\theta ^{(m)}\}_{m=1}^{M} \subset {\mathbb{R}}\), we can obtain
Based on Lemma 1, we construct an approximating solution of (3.3) that agrees with the approximation stated in (6.1). Next, by taking the temporal and spatial meshes finer enough, we show that the solution of (3.3) itself can approximate the target continuous function. In applying these steps, we make use of the estimate on the approximation accuracy of Galerkin approximation.
Let us consider the approximate problem of (3.3). Regarding the spatial variable, we employ the Galerkin approximation [77]. For this purpose, let \({\mathfrak {S}}_{h} \equiv \{S_{h}\}_{h}\) be a family of finitedimensional subspaces of \(H_{0}^{1}(I)\) with parameter \(h<1\) that tends to 0 [77]. In the sequel, we set an integer L and take \(h=\frac{1}{L}\) (i.e., we divide I into L equipartitions). It is also assumed that
holds (r is a positive value, for example, \(r=2\) [3]). We also define an approximation operator \(A_{h}:{\mathfrak {S}}_{h} \rightarrow {\mathfrak {S}}_{h}\) by using the sesquilinear form \(\sigma (\cdot ,\cdot )\) in (3.4) as follows:
Thus, \(A_{h}\) is the operator associated with the restriction of \(\sigma (\cdot ,\cdot )\) on \({\mathfrak {S}}_{h} \times {\mathfrak {S}}_{h}\) [21]. We further define an operator \(P_{h}\) indicating a projection of \(u \in L_{2}(I)\) onto \({\mathfrak {S}}_{h}\) with respect to the \(L_{2}\) inner product [21].
Then, we divide the time interval \((0,T]\) into N intervals \(\{[ (n1)k, nk) \}_{n=1}^{N}\), with \(Nk=T\). By using a notation \(U_{h}^{(n)} = [ U_{h(l)}^{(n)} ]_{l}\), we consider the discretized scheme of (3.3) on \(t \in (0,T]\).
where \(r(kA_{h})\) denotes the Padé approximation [3] of the semigroup
Here, \(I_{L}\) is an Ldimensional identity matrix. We also use a notation \(\\cdot \\) hereafter to denote a Euclidean norm; note that this is equivalent with \(L_{2}\) norm as long as we consider the piecewise \(L_{2}(I)\) functions.
Moreover, we introduce notations \(\overline{P}_{h} f = [(\overline{P}_{h} f)_{l}]_{l}\), with
for \(f \in L_{1}(I)\) in general, where \(I_{l} \equiv [\frac{l1}{L},\frac{l}{L})\) (\(l=0,1,2,\ldots ,L\)), and \(\overline{P}_{h}:L_{2}(I) \rightarrow {\mathfrak {S}}_{h}\), the projection onto the finite dimensional space \({\mathfrak {S}}_{h}\) for each \(h=\frac{1}{L}\). Because these are projection operators, note that the inequalities \(\\overline{P}_{h} f\ \leq \f\\) and \(\P_{h} f\ \leq \f\\) always hold. The value \(\\overline{P}_{h}f\\) is computed by regarding it as a simple function on I, and then taking the usual norm of \(L_{2}(I)\).
Remark 6
The operator \(\overline{P}_{h}\) has often been used in the literature on the discrete approximation of operators ([79, 81]). It is known that this is equivalent with the operation
in the sense that the following equality holds [81].
In (6.2), we utilized the vanishing Dirichlet condition of (3.3). From theory, \(u((N1)k,x)\) can be represented as follows [37].
where \(\lambda _{j}\) and \(\eta _{j}\) are the eigenvalues and eigenvectors of an operator A, respectively, and
Hereafter, we often use a notation \(u((N1)k,\cdot )\) to denote \(u((N1)k,\cdot ;0,\vec{\xi})\). Thus, we have
Hereafter, we use the notation \(\vec{\sigma}_{0} = [\sigma _{0(m)} ]_{m=1}^{M} \in { \mathbb{R}}^{M}\). We prepare a lemma.
Lemma 2
With \(L=aM\), where an integer a has a sufficiently large value, there exist \(\vec{w}_{0(h,k)}^{\prime }\in {\mathbb{R}}^{L}\), \(\{ \vec{\theta}_{0(h)}^{(p)}\}_{p=1}^{L} \subset {\mathbb{R}}^{J}\), and \(\{ \theta _{1(h)}^{(p)} \}_{p=1}^{L} \subset {\mathbb{R}}\) such that
Remark 7
The lefthand side of the equality in Lemma 2 is the inner product of the vectors in \({\mathbb{R}}^{L}\), whereas the righthand side is that of the vectors in \({\mathbb{R}}^{M}\).
Proof
First, we introduce disjoint subsets of \(\{1,2,\ldots ,L\}\):
It is obvious that \(\{1,2,\ldots ,L\} = \bigcup_{m=1}^{M} D_{(m)}\). Then, we take \(\vec{\theta}_{0(h)}^{(p)}\) and \(\theta _{1(h)}^{(p)}\) so that \(h\vec{\theta}_{0(h)}^{(p)} = \vec{\sigma}_{1}^{(m)}\) (\(p \in D_{(m)}\)) and \({\theta}_{1(h)}^{(p)} = {\theta}^{(m)} \) (\(p \in D_{(m)}\)), respectively. Let us take \(\vec{w}_{0(h,k)}^{\prime}\) so that the followings are satisfied.
where \({\boldsymbol {B}}_{h} =h [\vec{e}_{1},\vec{e}_{2},\ldots ,\vec{e}_{M}]\) is an \(L\times M\) matrix with \(\vec{e}_{j} = [H(l \in D_{(j)})]_{l} \in {\mathbb{R}}^{L}\), \(H(\cdot )\) being a function that returns unity if the statement in the bracket is true, and returns 0 otherwise. (6.4) means that \(\vec{w}_{0(h,k)}^{\prime}\) should belong to a subspace in \({\mathbb{R}}^{L2}\), which is denoted as \({\mathcal {G}}_{h}\) hereafter. Therefore, we rewrite (6.5) as follows:
where \({\boldsymbol {B}}_{h}^{\top }r(kA_{h})^{\top} _{{\mathcal {G}}_{h}}\) denotes the restriction of \({\boldsymbol {B}}_{h}^{\top }r(kA_{h})^{\top}\) onto the space \({\mathcal {G}}_{h} \subset {\mathbb{R}}^{L2}\).
Based on proposition 8.14, which was presented in [87], (6.6) can have a solution if and only if \(\vec{\sigma}_{0} \perp N(r(kA_{h}){\boldsymbol {B}}_{h} _{{\mathcal {G}}_{h}})\), where \(N(\cdot )\) denotes the kernel of the operator in its argument.
Conversely, we can show that \(r(kA_{h})\) is of full rank. In fact, if we consider that \(A_{h}\) is positive definite, all the eigenvalues of \(A_{h}\) are positive. Moreover, because \(A_{h}\) is selfadjoint, we observe that it is diagonalizable [12], and so is \(r(kA_{h})\). Thus, \(r(kA_{h})\vec{v}=0\) means \(\vec{v}=\vec{0}\). However, it is apparent that \({\boldsymbol {B}}_{h}\) is suborthogonal in the sense that its column vectors are orthogonal to each other. Thus, \(N(r(kA_{h}){\boldsymbol {B}}_{h} _{{\mathcal {G}}_{h}})=\{0\}\), which yields the desired result. □
Remark 8
From construction, the solution \(\vec{w}_{0(h,k)}^{\prime}\) to (6.6) depends on h and k. As above, (6.6) has at least one solution. If we denote this solution by \(\breve{w}_{0(h,k)}^{\prime}\), then a set of solutions for (6.6) can be denoted as \(\breve{w}_{0(h,k)}^{\prime }+ N({\boldsymbol {B}}_{h}^{\top }r(kA_{h})^{\top})\).
Now, we define:
where \(w_{0(h,k)(l)}^{\prime}\) is the lth component of the vector \(\vec{w}_{0(h,k)}^{\prime}\). We assert that, for a certain \(R>0\), we have at least one solution stated in Lemma 2, in a certain ball with radius R in \(L_{2}(I)\). In the sequel, we use the following notations:
Now, we state
Lemma 3
For a certain \(R>0\), k and \(h_{1}>0\), we have a solution \(\vec{w}_{0(h,k)}^{\prime}\) to (6.4) and (6.5) that satisfies
for \(\forall h \in (0,h_{1}]\).
Proof
First, we take a small \(k>0\), \(h_{1}=\frac{1}{L_{1}}>0\), and \(\vec{w}_{0(h_{1},k)}^{\prime }\in {\mathbb{R}}^{L_{1}}\), which satisfy
Note that the existence of such \(\vec{w}_{0(h_{1},k)}^{\prime}\) is guaranteed by Lemma 2. Moreover, the solution \(\vec{w}_{0(h_{1},k)}^{\prime }\) to (6.7) belongs to the intersection of \({\mathcal {G}}_{h_{1}}^{(k)}\) and a set represented as \(\breve{w}_{0(h_{1},k)} + N({\boldsymbol {B}}_{h_{1}}^{\top})\), where \(\breve{w}_{0(h_{1},k)}^{\prime}\) is the solution to the problem in \({\mathfrak {S}}_{h_{1}}\):
with \(\frac{1}{h_{1}}= L_{1}=a_{1} M\). Here, \(\widetilde{\sigma}_{0}\) is a notation used when we regard \(\vec{\sigma}_{0}\) as an element in \(L_{2}(I)\). Hereafter, we often regard \({\mathcal {G}}_{h_{1}}^{(k)}\) as a subset of \(L_{2}(I)\). Note that we can easily obtain the solution of (6.8) if we recall the definition of \(r(kA_{h_{1}})\). We denote one such solution as \(\breve{w}_{0(h_{1},k)}^{\prime }\in {\mathcal {G}}_{h_{1}}^{(k)}\) again:
For \(h>0\), we define a map \(G_{h}^{(k)} : L_{2}(I) \rightarrow {\mathfrak {S}}_{h}\) as follows:
where \(\breve{P}_{{\mathcal {G}}_{h}^{(k)}} : L_{2}(I) \rightarrow { \mathcal {G}}_{h}^{(k)}\) is a projection onto \({\mathcal {G}}_{h}^{(k)}\) with respect to the \(L_{2}\) inner product.
We will below that if the norm of \(\breve{w}_{0(h,k)}^{\prime}\) is large enough, even if we take the projection above, the norm of \(\breve{w}_{0(h,k)}^{\prime}\) after the projection is large enough as well.
Next, we define
Because the dimension of \(({\mathcal {G}}_{\infty}^{(k)} )^{\perp}\) is of unity, it holds that \(S_{R}^{(k)} \ne \emptyset \). We also take a small \(\varepsilon _{1}>0\) and sufficiently small h̃ so that
This is possible if we note
and the relationship that holds with \(v \in L_{2}(I)\) [3]:
with r being the one stated right after (6.1), where \(\gamma (h)\) tends to zero as h does. Moreover, let \(R>0\) have a sufficiently large value so that the following holds (R should be redefined, if necessary):
In order to show that this is possible, we can demonstrate the continuity of the resolvent \(r(kA)=(I_{d}+kA)^{1}\) with respect to k, where \(I_{d}\) is an identity operator. We can prove this by using the resolvent equation [39] and the boundedness of \(r(kA)\), as presented by Fujita and Mizutani [21]. Thus, for \(\varepsilon _{1}>0\) above, if we take a sufficiently small k, we have
for \(v_{0} \in S_{R}^{(k)}\). This yields
with an arbitrary \(\varepsilon _{1}>0\). Therefore, if we take R sufficiently large, we arrive at (6.10) and consequently,
for this \(v_{0}\). Now, for an arbitrary \(h_{2}=\frac{1}{a_{2}M}\) with \(a_{2}>a_{1}\), we define a homotopy mapping \(H:L_{2}(I) \times [0,1] \rightarrow {\mathfrak {S}}_{h_{2}}\):
where \(s \in [0,1]\) and \(D_{a_{2},a_{1}}\) is a \(a_{2}M \times a_{1} M\) matrix whose components are either 0 or 1. That is, this matrix is used to translate the image of \(G_{h_{1}}^{(k)}\) as an element of \({\mathbb{R}}^{a_{2}M}\). In virtue of the arguments presented above, we have
Then, we have
and \(H(f,s)\) is a compact operator for each s because its range has a finite dimension. Owing to the result of degree theory [87], we can conclude that the equation
has a solution. Consequently,
has a solution \(\forall h_{2} \in (0,h_{1}]\) that satisfies \(\f\ \leq R\). If we take \(\vec{w}_{0(h_{2},k)}^{\prime }= \breve{P}_{{\mathcal {G}}_{h_{2}}^{(k)}}f \), this is the desired solution. □
By using this, we assert the following lemma.
Lemma 4
Let h and k be sufficiently small positive numbers. Then, for an arbitrary \(\vec{\xi} \in K\subset {\mathbb{R}}^{J}\) and \(\varepsilon >0\), there exists an array \({\boldsymbol {W}} = [w_{p,l}^{(N1)} ]_{p,l=1,2,\ldots ,L}\), with which \(\widetilde{U}_{h}^{(N)}\) defined in (6.2) satisfies
Proof
In fact, based on (6.4) and (6.5), we consider the following equations for \({\boldsymbol {W}} = [w_{p,l}^{(N1)} ]_{p,l=1,2,\ldots ,L}\).
For each fixed p (\(1\leq p \leq L\)), this can be written as an equation for \(\vec{w}_{p} \equiv [w_{p,l}^{(N1)}]_{l}\) as shown below:
where
Thanks to the same argument as in the proof of Lemma 2, we shall show that \(N({\boldsymbol {T}}_{p}^{\top}) = \{0\}\). In fact, if we recall that \({\boldsymbol {T}}_{p}\) is a linear map from \({\mathbb{R}}^{2}\) to \({\mathbb{R}}^{L}\), if \({\boldsymbol {T}}_{p}^{\top }\vec{q}=\vec{0}\) holds with \(\vec{q}=(q_{1},q_{2})^{\top}\), then, it can be easily observed that \(q_{2}=0\) (actually, adding a nonvanishing Dirichlet boundary condition in (3.2) works here). Regarding \(q_{1}\), if \(q_{1} \ne 0\), all the following equalities should hold:
However, from (6.3), this means that
which does not hold if we take L sufficiently large. In fact, for an arbitrary \(\varepsilon ^{\prime}>0\), if we take \(h>0\) small enough, we obtain
and thus, we have
But (6.14) implies that if we take h sufficiently small, then we can attain
Thus, we have \(\Z*u_{0}(T,\cdot ) \ <\varepsilon ^{\prime}\). Because \(\varepsilon ^{\prime}\) is arbitrary, we have \(Z*u_{0}(T,\cdot )=0\). If we recall (6.3), this implies
from which we obtain \(\lambda _{j} (\eta _{j},\sum_{q=1}^{J} \xi _{q} \chi _{I_{q}} ) e^{\lambda _{j}(N1)k} = 0\), and consequently, \((\eta _{j},\sum_{q=1}^{J} \xi _{q} \chi _{I_{q}} ) = 0\) \(\forall j=1,2,\ldots \) . This means \(u_{0} \equiv 0\), a contradiction. Thus, we can conclude that \(\vec{q} = \vec{0}\), which means that \(N({\boldsymbol {T}}_{p}^{\top}) =\{\vec{0}\}\). Thus, (6.13) and consequently (6.12) has a solution. This means that
holds with \({\boldsymbol {W}} = [w_{p,l}^{(N1)} ]_{p,l=1,2,\ldots ,L}\) prescribed. Moreover, recalling (6.3), this is rewritten as
If we further recall (6.2), this implies that
where \(\phi _{1} = \phi ( h\sum_{l=2}^{L1} w_{p,l}^{(N1)} \widetilde{U}_{h(l)}^{(N1)} + h \sum_{l=1}^{L}w_{p,l}^{(N1)} ) \).
Note that \(\\vec{w}_{0(h,k)}^{\prime}\\) is bounded with respect to \((h,k)\) thanks to the proof of Lemma 3. Thus, if we take k small enough, we can bound the second term of the lefthand side above small enough. This, together with Lemma 1, yields the desired statement. □
By using the solution of (6.13), we construct a function \(\bar{w}_{1}(t,x,y)\)
By noting that \(\bar{w}_{1} \in L_{2}({\mathcal {H}}_{T})\), we set \(\bar{u}(t,x) \equiv u(t,x;\bar{w}_{1},\vec{\xi})\), which solves (3.3) with \(w_{1}=\bar{w}_{1}\). This neural network with \(\bar{w}_{1}\) and ū can be regarded as a forward neural network with (6.2), which is a kind of RBF network with a Gaussian kernel [7, 84]; in our case, however, we use the fundamental solution with the Dirichlet condition. We have a similar result to Lemma 3 for \(\bar{w}_{1}\) as well.
Corollary 2
For a certain \(R^{\prime}>0\), k and \(h_{2}>0\), we have a solution \(\vec{w}_{p}\) to (6.13) that satisfies
for \(\forall h \in (0,h_{2}]\). Thus, for \(\bar{w}_{1}\) defined in (6.15), we have
Remark 9
Regarding the mapping degree of a map between two spaces with different dimensions, one can refer to, for instance, the work of [55].
7 Proof of Theorem 3
Now, we present Lemma 5 below, which is crucial for the proof of the main theorem. It assures that we can make \(\widetilde{U}_{h}^{(N)}\) and \(\bar{u}(T)\) be sufficiently close if we take h and k small enough while maintaining some relationship. In its proof, we insert another variable \(V_{h}^{(N)}\), with which we can prove the lemma by using the estimate of the fundamental solution of heat equation.
After proving Lemma 5, we can easily prove Theorem 3, by using Lemmas 3 and 4 as well. First, we note that we have the estimate of \(\phi (\cdot )\) just we did in the proof of Theorem 2 right above (B.18) in Appendix B. Combining it with the apriori estimate there, we can estimate the lefthand side from above in the form
where \(c(u_{0})\) is a positive constant that depends on \(u_{0}\).
Lemma 5
\(\{ \widetilde{U}_{h}^{(n)} \}_{n}\) defined in (6.2) satisfies:
where \(d( k,h)\) is a constant that is independent of \(w_{1}\), and tends to 0 when k and h tend to 0 satisfying \(h^{2}=o (\log ( \frac{T}{k} )^{1} )\).
Proof
The overview of the proof of this lemma goes as follows. Because ū is a continuous variable, while \(\widetilde{U}_{h}^{(N)}\) is a discretized one, we shall insert another variable \(V_{h}^{(N)}\), and estimate both \(\V_{h}^{(N)}\bar{u}(T)\\) and \(\V_{h}^{(N)}\widetilde{U}_{h}^{(N)}\\), to obtain the desired result. For the first one, we shall estimate the accuracy of the discrete approximation of a continuous solution by its discretization. For the latter one, we shall make use of the property of Padé approximation.
On the one hand, based on Duhamel’s principle, ū satisfies the following:
On the other hand, \(\{ \widetilde{U}_{h}^{(n)} \}_{n}\) defined in (6.2) satisfies the following:
We also consider
Note that \(\bar{w}_{1}\) is a piecewise constant function from its construction, thus the righthand side above makes sense. Recalling \(T=Nk\), we first consider \(\\bar{u}(T){V}_{h}^{(N)}\\). We have
Based on the estimate presented by Fujita and Mizutani [21], we have
where \(c_{81}\) is a positive constant. On the other hand, regarding the second term of the righthand side of (7.2), we have
Regarding the estimate of \(J_{1}\), let us recall (6.15). Then, following the direction of Hoff and Smoller [30], we have
On the other hand,
Note that for an integrable function \(f(s)\) of s, we have
Thus, we have
which yields \(J_{2} \leq c_{82} ( k+ko(k) )\) with some \(c_{82}>0\). As for the estimate of \(J_{3}\), applying (7.3) again together with (7.1) and Corollary 2, we have
Recalling (6.15), the rightmost hand side is estimated by
where \(P_{1}(t)=\{t\}\frac{1}{2}\) with \(\{x\}\) being the fractional part of its argument, and we have used the Euler–Maclaurin formula [38]. Combining these, under the assumption of the lemma, we arrive at the following:
where \(c(h,k) \rightarrow 0\) as \(h,k \rightarrow 0\) satisfying \(h^{2} = o ( \log (T/k)^{1} )\). Next, we estimate \(\{V}_{h}^{(N)}\widetilde{U}_{h}^{(N)}\\). Recall the following equalities.
Thus, we have
Regarding the first term of the righthand side of (7.6), we have the following inequality.
Because N is sufficiently large, we have \(e^{(N2)kA}u_{0} \in H^{2}(I)\). For an arbitrary \(\varepsilon _{2}>0\), if we take h sufficiently small, we can obtain
for a uniformly continuous function f in general. Moreover, we have
By combining this and (7.8), and applying to (7.7), we obtain the estimate
Regarding the second term of the righthand side of (7.6), we have
Thus, (7.6), (7.7), and (7.9) yield
Next, we proceed to the estimation of \(\ \widetilde{U}_{h}^{(N)} V_{h}^{(N)} \\). Note that \(V_{h}^{(N)}\) satisfies the following recurrence relation:
Then, by using (6.2), we observe that
Recalling (6.15), we have that \(h \sum_{l=1}^{L} w_{\cdot ,l}^{(N1)} =\int _{I} \bar{w}_{1}(Nk, \cdot ,y) \,\mathrm{d}y\), and
Similarly, if we recall the definitions of \(\overline{P}_{h}\), we have
Thus, we have
which implies that the second term of (7.11) vanishes. Thus, owing to this and (7.10), we can estimate (7.11) as shown below.
Regarding the second term on the righthand side of (7.11), let us recall (6.15) and the definition of \(\overline{P}_{h}\) to obtain:
Similarly, if we recall the definitions of \(\overline{P}_{h}\), \(\widetilde{U}_{h(l)}^{(N1)}\) and
we observe
Thus, we arrive at the estimate:
Finally, by combining (7.2) and above, we arrive at the desired inequality of Lemma 5 because \(\varepsilon _{2}>0\) is arbitrary. □
Owing to Lemmas 3, 4, and 5, if the spatiotemporal mesh is sufficiently fine and the relationship \(h^{2}= o (\log ( \frac{T}{k} )^{1} )\) is satisfied, we can approximate the solution u of (3.3) with the fully discretized one, for example, \(( (\triangle x)_{i},(\triangle t)_{i} )\), regardless of \(w_{1}\). This leads us to the proof of Theorem 3 in Sect. 4. Actually, owing to Lemma 4, we can assume that \(\\vec{w}_{0(h,k)}^{\prime}\ \leq R\) with some \(R>0\). Then, we have
Thus, if h and k are set to have sufficiently small values maintaining the relationship \(h^{2}= o (\log ( \frac{T}{k} )^{1} )\), the righthand side of (7.12) can be less than ε. This proves Theorem 3.
Remark 10
The estimate (7.12) above is observed for a fixed value of ν. Actually, the estimate (7.3) in Fujita and Mizutani [21] is obtained by assuming \(\nu =1\). In the general case, let us introduce a transform \(\bar{x} = x/\sqrt{\nu}\). Then, a problem
with an initial value \(u_{0}(x)\) for a function \(u(t,x)\) is transformed into the form:
for a function \(\bar{u}(t,\bar{x})\) with an initial value \(\bar{u}_{0}(\bar{x}) = u_{0}(x)\). We can easily find that
which means that the righthand side of (7.12) diverges in case \(\nu \rightarrow 0\). Therefore, we leave the discussion concerning the convergence of the universal approximation property we have proved here when \(\nu \rightarrow 0\) as an open problem.
8 Capacity and learnability of the model
In the proof of universal approximation property, we fixed the values of \(w_{1}\) up to the time right before the terminal moment. However, this does not mean that the temporal direction is not necessary in our model. Universal approximation property is not the only property that a learner should possess; indeed, the learnability and generalization performance are also important. In this section, we will observe that our model possesses the learnability in some sense. In doing so, we will also observe that the estimations to deduce the learnability depend on time.
In this regard, we will discuss other aspects of the proposed model in the sequel. First, we will address its learnability, specifically focusing on classification performance metrics or classes of functions, such as the VCdimension and GlivenkoCantelli class. Our discussion is limited to binary classifications. Although we discuss VCdimension, we delegate its definition to some monographs [8, 65]. Finally, we present the results of our numerical experiments.
8.1 Learnability
In this section, we discuss the learnability of the proposed model. Hereafter, we will often denote the solution to (3.3) (or equivalently, (3.1)–(3.2)) as \(u(t,x;w_{1},\vec{\xi},\nu )\) for clearly indicating its dependence on \(w_{1}\), ξ⃗, and ν. Then, given the terminal moment T and diffusion coefficient ν, we define a hypothesis set. This set comprises functions on \({\mathbb{R}}^{J}\) realized by our model:
Let us discuss the learnability of \({\mathscr{F}}_{T}^{(\nu )}\). In the following, the VCdimension of a hypothesis set \({\mathscr{F}}\) is denoted as \(VC({\mathscr{F}})\). Our first result is
Theorem 4
Suppose that the assumptions of Theorem 3are satisfied. Let T and ν be arbitrary positive numbers. Subsequently, for our proposed PDEbased neural network,
Proof
Suppose that we are given an arbitrary \(N \in {\mathbb{N}}\) and a dataset \(\{\vec{\xi}_{i},y_{i}\}_{i=1}^{N} \in {\mathbb{R}}^{J} \times \{\pm 1 \}\). Then, let us take \(\varepsilon >0\) so that \(B(\vec{\xi}_{i};\varepsilon ) \cap B(\vec{\xi}_{i};\varepsilon ) = \phi (i \ne j)\). In virtue of Theorem 3, by suitably taking \(w_{0}\) and \(w_{1}\), we can make a continuous function \(f \in {\mathscr{F}}_{T}^{(\nu )}\) which associates each element in \(B(\vec{\xi}_{i};\varepsilon )\) with \(y_{i}\) for all \(i=1,2,\ldots ,N\). This means that the set \({\mathscr{F}}_{T}^{(\nu )}\) shatters the given dataset with an arbitrary \(N \in {\mathbb{N}}\). □
Theorem 4 also implies that we require an infinite amount of training data, which is practically impossible, and that our model is not PAClearnable in the classical sense [65]. However, using the concept of a structural risk minimization (SRM) scheme, we can still make it nonuniformly learnable [65]. A relaxation of the concept of learnability of this kind has also been applied to support vector machines [8].
To discuss this in more detail, we introduce certain notations. In general, the “risk” over a loss function \(l(\cdot )\) and a general hypothesis set \({\mathscr{F}}\) is defined by the following:
where \({\mathcal {D}}\) is an unknown datagenerating distribution defined as follows: \({\mathcal {Z}} \equiv {\mathcal {X}} \times \{\pm 1\}\), with \({\mathcal {X}}\) being a set of inputs. The notation \(z \sim {\mathcal {D}}\) means that a random variable z is drawn from \({\mathcal {D}}\). Similarly, we use the notation \(S \sim {\mathcal {D}}^{m}\) to denote that a dataset S of sample size m is i.i.d. drawn from \({\mathcal {D}}\). If some \(h \in {\mathscr{F}}\) attains (8.2), we call it a Bayesian hypothesis. However, we usually do not know the actual distribution \({\mathcal {D}}\). For this reason, we usually try to minimize the surrogate quantity, which is called the empirical risk:
where \(S = \{(\vec{\xi}_{i},y_{i})\}_{i=1}^{m} \subset {\mathcal {Z}}\) represents the training data drawn from the original unknown distribution \({\mathcal {D}}\). This framework is called empirical risk minimization (ERM). Utilizing the law of large numbers, \(L_{S}(h)\) converges to the true risk as \(m \rightarrow +\infty \) for each h. We also define
where \({\mathrm{ERM}}_{{\mathscr{F}}}(S)\) denotes a hypothesis returned as (one of) the minimizer(s) of the empirical risk under training dataset S.
To evaluate the “goodness” of the training data, we define the following concept.
Definition 1
A training set S is called εrepresentative with respect to the domain \({\mathcal {Z}} \equiv {\mathcal {X}} \times \{\pm 1\}\), hypothesis set \({\mathscr{F}}\), loss function \(l(\cdot )\), and distribution \({\mathcal {D}}\) if the following holds.
To determine the conditions under which the ERM scheme works well, we need the following definition (please refer to [65], Definition 4.3).
Definition 2
We say that a hypothesis set \({\mathscr{F}}\) possesses the uniform convergence property with respect to the domain \({\mathcal {Z}}\) and loss function \(l(\cdot )\) if there exists a function \(m_{\mathscr{F}}^{UC}:(0,1)^{2} \rightarrow {\mathbb{N}}\), which is called the sample complexity, such that for each \(\varepsilon ,\delta \in (0,1)\) and for every probability distribution \({\mathcal {D}}\) over \({\mathcal {Z}}\), if S is a sample of \(m \geq m_{\mathscr{F}}^{UC}(\varepsilon ,\delta )\) elements that are drawn i.i.d. according to \({\mathcal {D}}\), then, with a probability of at least \(1\delta \), S is εrepresentative.
A wellknown theorem states that (see [65], Theorem 6.7) the uniform convergence property is equivalent to the fact that the VCdimension of the hypothesis set is finite. Thus, together with Theorem 4 above, our hypothesis set \({\mathscr{F}}_{T}^{(\nu )}\) does not satisfy the uniform convergence property itself (consequently, neither PAC nor agnostic PAC is learnable, although we omit the definitions of these terms here). However, we can also consider a relaxed concept of learnability [65].
Definition 3
A hypothesis set \({\mathscr{F}}\) is said to be nonuniformly learnable if there exists a learning algorithm A that associates a dataset S with a hypothesis \(A(S) \in {\mathscr{F}}\) and a function \(m_{\mathscr{F}}:(0,1)^{2} \times {\mathscr{F}} \rightarrow {\mathbb{N}}\), such that for every \(\varepsilon , \delta \in (0,1)\), and for every \(h \in {\mathscr{F}}\), if \(m\geq m_{\mathscr{F}}(\varepsilon ,\delta ,h)\) then for every distribution \({\mathcal {D}}\) over \({\mathcal {X}} \times \{\pm 1\}\), with a probability of at least \(1\delta \) over the choice of \(S \sim {\mathcal {D}}^{m}\), it is ensured that
The following theorem [65] describes an important characterization of nonuniform learnability.
Theorem 5
Let \({\mathscr{F}}\) be a hypothesis set that can be written as a countable union of the individual hypothesis sets.
where each \({\mathscr{F}}_{n}\) exhibits a uniform convergence property. Then, \({\mathscr{F}}\) is nonuniformly learnable.
Returning to our specific case, we can show that our hypothesis set \({\mathscr{F}}_{T}^{(\nu )}\) defined in (8.1) is nonuniformly learnable. To demonstrate this, we will introduce a sequence of hypothesis sets.
Evidently, these sets form the following relationships.
Next, we demonstrate that each set \({\mathscr{F}}_{T}^{(\nu )}(n)\) in (8.4) satisfies uniform convergence property. We also use the notations
To assess the uniform convergence property of \({\mathscr{F}}_{T}^{(\nu )}(n)\) with respect to the loss function \(\tilde{l}(\cdot )\), it is necessary and sufficient to check that the set \({\mathscr{L}}(n)\) is a Glivenko–Cantelli class [65].
Hereafter, we denote a probability space as \((\Omega ,{\mathscr{A}},P)\), where Ω is the sample space, \({\mathscr{A}}\), the σalgebra with respect to probability measure P. We also denote a corresponding empirical measure as \(P_{m}(A) =\frac{1}{m} \sum_{j=1}^{m} \delta _{\vec{\xi}_{j}}(A)\) for a Borel set A with \(\delta (\cdot )\) being Dirac measure, and define
Definition 4
Given a probability space \((\Omega ,{\mathscr{A}},P)\) and a set of integrable realvalued functions \({\mathscr{F}}\), we say that \({\mathscr{F}}\) is a Glivenko–Cantelli class for P if and only if
holds almost uniformly.
In case of binary classification, being a Glivenko–Cantelli class is equivalent to satisfying the uniform convergence property [65]. Moreover, the following theorem is known [16]. Here, \(I^{d} = [0,1]^{d}\) with \(d \in {\mathbb{N}}\).
Theorem 6
Let \(K>0\) and \({\mathscr{F}}_{1,K}(I^{d})\) be a set of the Lipschitz continuous functions on \(I^{d}\):
Then, \({\mathscr{F}}_{1,K}(I^{d})\) is a Glivenko–Cantelli class for any probability measure P on \(I^{d}\).
Thus, if we impose Lipschitz continuity on the loss function, we can guarantee that the set \({\mathscr{L}}(n)\) becomes a Glivenko–Cantelli class for each n.
Theorem 7
Suppose that the assumptions of Theorem 3hold. Let \(T>0\) be arbitrary, \({\mathcal {Z}} = {\mathcal {X}} \times {\pm 1}\) with \({\mathcal {X}} \subset {\mathbb{R}}^{J}\) being compact, and a loss function \(l(\cdot ):{\mathcal {Z}} \times L_{2}(I)\times L_{2}({\mathcal {H}}_{T}) \rightarrow {\mathbb{R}}\) of the form
with a function \(\tilde{l}(a,y):{\mathbb{R}}\times {\mathbb{R}}\rightarrow {\mathbb{R}}\) being Lipschitz continuous with respect to \((a,y)\) with Lipschitz coefficient L. Then, the set \({\mathscr{L}}(n)\) is a Glivenko–Cantelli class.
Proof
The following is a simple denotation: \(\w_{0}\_{L_{2}(I)}\), \(\w_{1}\_{L_{2}({\mathcal {H}}_{T})}\) by \(w_{0}\), \(w_{1}\), respectively. Without losing generality, we can assume that \({\mathcal {X}} =I^{J}\). Under the assumptions of the theorem, we have
In order to verify the continuity of \(u(T,x;w_{1},\vec{\xi},\nu )\) with respect to ξ⃗, we appeal to a standard energy estimate. Let us denote \(u(t,x;w_{1},\vec{\xi}_{i},\nu )\) (\(i=1,2\)) by \(u_{i}(t,x)\) and \(\tilde{u}(t,x) \equiv u_{1}(t,x)u_{2}(t,x)\). Then, we have
By the Gronwall’s inequality, we obtain [74]
By noting \(u(0;\vec{\xi}_{i})^{2} =\frac{1}{J} \\vec{\xi}_{i}\_{{\mathbb{R}}^{J}}^{2}\), and consequently, \( \tilde{u}(0) ^{2} = \frac{1}{J}\\vec{\xi}_{1}\vec{\xi}_{2} \_{{\mathbb{R}}^{J}}^{2}\), and combining (8.5) and (8.6), we obtain
By Theorem 6, this implies that \({\mathcal {L}}(n)\) forms a Glivenko–Cantelli class. □
Theorem 7 implies that our model achieves uniform convergence property of the hypothesis set under the boundedness of \(\w_{0}\_{L_{2}(I)}\) and \(\w_{1}\_{L_{2}({\mathcal {H}}_{T})}\) and the compactness of the input space \({\mathcal {X}}\) with which \({\mathcal {D}}\) is defined. Thus, for each \(n \in {\mathbb{N}}\), we establish that \({\mathscr{F}}_{T}^{(\nu )}(n)\) has a uniform convergence property with respect to this \(l(\cdot )\) and \({\mathcal {D}}\).
Before introducing another theorem, let us present a known lemma [80] concerning the covering number \(N(\cdot )\) and bracketing number \(N_{[]}(\cdot )\). We delegate the definitions of these quantities to other references (see, for instance, [16, 25, 80]).
Lemma 6
Let \({\mathcal {F}} = \{f_{t} t \in {\mathcal {T}}\}\) be a class of functions defined on a set \({\mathcal {X}}\) satisfying Lipschitz continuity in the index parameter:
for some fixed function \(F(\cdot )\), where \(d(\cdot ,\cdot )\) is a metric in the index space \({\mathcal {T}}\). Then, for any norm \(\\cdot \\), \(N_{[]} ( 2\varepsilon \F\,{\mathcal {F}} ,\\cdot \ ) \leq N( \varepsilon ,{\mathcal {T}},d)\).
We also introduce the following lemma concerning the metric entropy of a set of functions.
Lemma 7
Let \(B_{M} \equiv \{u \in H^{1}(I)  \u\_{H^{1}(I)} \leq M\}\). Then, \(B_{M}\) is relatively compact in \(L_{2}(I)\) and satisfies
where K is a constant.
Proof
This lemma can be proved if we take \(p=q=2\) in Theorem 4.3.36 of [25] and note that the inclusion of function spaces \(H^{1}(I) \subset B_{2 \infty}^{1,W}(I)\), where \(B_{2 \infty}^{1,W}(I)\) is the Besov space defined in [25]. □
In the optimization procedure, it is often the case that \(w_{0}\) is determined depending on \(w_{1}\), and consequently \(u(T,x;w_{1})\). Based on Lemmas 6 and 7, we can assert the following theorem.
Theorem 8
Under the assumptions of Theorem 7, suppose that \(\tilde{l}(a,y)\) is Lipschitz continuous with respect to its first argument a and \(w_{0}\) can be determined as a functional of \(u(T,x;w_{1})\): \(w_{0} = w_{0}(u(T,x;w_{1}))\) and satisfies
with some \(L_{w}>0\). Then, the the set \({\mathscr{L}}(n)\) is a Glivenko–Cantelli class.
Proof
Let us simply denote \(w_{0}=w_{0}(u(T,x;w_{1}))\) and \(w_{0}(x)^{\prime }(x)=w_{0}(u(T,x;w_{1}^{\prime}))\). We first show that
where \(c_{T}^{(\nu )}>0\) is some constant depending on T and ν. Here, we have used the assumption on \(w_{0}\) as well as the assumption \(\w_{0}\_{L_{2}(I)} \leq n\), and the boundedness of \(u(T,\cdot )\), which can be derived as follows.
Applying the standard energy estimate to (3.3) yields the following equation:
By introducing the notation \(c_{1}=\phi (0)^{2}\), together with Gronwall’s inequality, we obtain [74]
By noting \(u(0;\vec{\xi})^{2} =\frac{1}{J} \\vec{\xi}\_{{\mathbb{R}}^{J}}^{2}\), and together with (8.9), we obtain the following:
Moreover, we can estimate the righthand side of (8.8) (we omit the procedure of this estimate, for it is quite similar to the deduction of (8.6)). This, combined with (8.8), implies that the assumption of Lemma 6 is satisfied if we regard \({\mathcal {L}}(n)\) as a set of functions indexed by a set of functions of the form \(u(T,\cdot ;w_{1}) \in H^{1}(I)\). Indeed, in this case, (8.7) holds, where \(d(\cdot ,\cdot )\) is \(L_{2}(I)\)norm and \(F(x)\) is a constant. Thus, Lemma 6 implies \(N_{[]} ( 2\varepsilon c_{T}^{(\nu )},{\mathcal {L}}(n) ,\cdot  ) \leq N(\varepsilon ,{\mathcal {B}}_{H^{1}(I)}^{M},\\cdot \_{L_{2}(I)}) \leq \frac{KM}{\varepsilon}\), where \({\mathcal {B}}_{H^{1}(I)}^{M}\) denotes a ball in \(H^{1}(I)\) with radius M. Because a finite bracketing number implies that the function space is a GlivenkoCntelli class, this completes the proof. □
Remark 11
Note that in the proof of Theorem 8, the estimate above depends on T and ν, which implies that the generalization performance may depend on them. As a special case, when \(\phi (\cdot )\) is bounded, we obtain the following:
which implies that the increase of T may lead to a smaller covering number.
We have seen that under some conditions, \({\mathscr{L}}(n)\) is a Glivenko–Cntelli class, and consequently, \({\mathcal {F}}_{T}^{(\nu )}(n)\) has a finite VCdimension and sample complexity, say \(d_{n}\) and \(m^{UC}_{{\mathscr{F}}_{T}^{(\nu )}(n)}(\epsilon ,\delta )\), respectively. To examine nonuniform learnability of \({\mathcal {F}}_{T}^{(\nu )}\), let us consider
Then, it clearly holds that for each \(n \in {\mathbb{N}}\).
In addition, if we consider a family of functions \(w(n):{\mathbb{N}} \rightarrow [0,1]\) that satisfies \(\sum_{n=1}^{\infty }w(n) \leq 1\), we have the following approach called structural risk minimization (SRM) (Algorithm 1) [65]:
Theorem 9
Let \({\mathscr{F}}\) be a hypothesis class, such that \({\mathscr{F}} = \bigcup_{n} {\mathscr{F}}_{n}\), where each \({\mathscr{F}}_{n}\) has uniform convergence property with sample complexity \(m_{{\mathscr{F}}_{n}}^{UC}\). Let \(w:{\mathbb{N}} \rightarrow [0,1]\) be defined as \(w(n)=6/n^{2}\pi ^{2}\). Then, \({\mathscr{F}}\) becomes nonuniformly learnable using the SRM scheme at a rate
Theorem 9 with \({\mathscr{F}}_{n}\) replaced by \(\tilde{l} \circ {\mathscr{F}}_{T}^{(\nu )}(n)\) guarantees that our PDEbased neural network has nonuniform learnability.
8.2 Numerical computation
Finally, we conducted some numerical experiments to evaluate the performance of our model using practical datasets. Because the main focus of the present paper is the theoretical argument, this is the first example to check the effectiveness of our model. In the following section, we first clarify the setting of our numerical experiment and then state the results.
8.2.1 Settings
In this experiment, we focused exclusively on binary classification. The proposed model was implemented using Python 3.7 on a Windows Server 2019 (64 bits), 12th Gen Intel (R) Core (TM) i712700, 2.11 GHz, RAM 96.0 GB. In this experiment, we used the time difference \(\triangle t = 5 \times 10^{4}\) and a range of values for the number of temporal and spatial grids, denoted as N and L, respectively. At the output layer, we employed a logistic regression scheme with \(L_{1}\) regularization using statsmodels [71]. The optimization of \(w_{1}\) in our model involved optimizing the values \(w_{1}({i_{1},i_{2},i_{3}})\) (\(i_{1}=1,2,\ldots ,N\), \(i_{2}, i_{3}=1,2, \ldots ,L\)), each of which is a temporally discretized version of \(w_{1}(t,x,y)\). Optimization was conducted using a genetic algorithm with the deap library [13] in Python.
8.2.2 Datasets
Numerical simulations are conducted with “adult income” [5] and “diabetes” [15] datasets, which are wellknown benchmarks of binary classification.
The former dataset contains 121 adult attributes and their annual income. The task is to predict whether their income is larger than 50 thousands dollars (which corresponds to the label “1”) or not (“0”). The latter dataset contained eight attributes with the human subjects and a binary label indicating whether each subject had the symptoms of diabetes.
Table 1 presents an overview of the datasets. For both datasets, we employed 70% of the training dataset, and the remaining part of the training dataset was used to check the test accuracy.
We applied min–max scaler, which transforms the values of each attribute onto the interval \([0,1]\).
8.2.3 Results of experiments
Tables 2 and 3 show the results of the training and test accuracies, the area under the curve (AUC) (boldface indicates the largest value for each indicator) under a range of values T, and the number of points in the discretization of both spatial and temporal directions. The performance of the proposed method was comparable to that of the existing methods (Random Forest Classifier (RFC), Support Vector Classifier (SVC) with RFB kernel, XGBoost, and LightGBM) in terms of test accuracy and AUC. Note that in the existing methods, we tuned the hyperparameters by using crossvalidation and gridsearch.
The values of generations and population size in the genetic algorithm are 5 and 10, respectively, for “adult income” dataset, and 10 and 200 for ‘diabetes’ dataset. This is due to fact that the “adult income” dataset is larger and requires much longer computation time. From Tables 2 and 3, we observe that the performance of our model varies depending on the values of T.
In summary, the considerations in this section imply the following issues:

(i)
Although our model achieves an infinite VCdimension, it is still nonuniformly learnable under some assumptions about the underlying distribution behind the dataset. This property is also observed in some wellknown machine learning algorithms, such as Support Vector Machine (SVM) with kernels.

(ii)
By adjusting the parameters, we can adjust the generalization performance of our model. On the one hand, optimal values of the parameters yield a model with lower generalization error. On the other hand, this enlarges the search space during optimization, leading to the concern that we might not attain a (sub)optimal solution within a realistic computation time. Therefore, in our future work, we will continue to search for an effective approach to optimize our model.
9 Conclusion
This study demonstrates the universal approximation property of our PDEbased neural network. It has been demonstrated that any continuous function on a compact set in \({\mathbb{R}}^{J}\) can be approximated by the output of a neural network with arbitrary precision.
We have also discussed the learnability of our model. Moreover, we implemented our model on a computer and performed certain numerical experiments. It showed a comparable performance to that of the existing models, such as RFC, SVC, LightGBM, and XGBoost. It was shown that the generalization performance could be adjusted by some parameters of the model. The exploration of more effective optimization procedures can be performed in the future.
Future work will consider the limit when ν tends to zero, in which case the proposed model could be considered the continuous limit of the usual neural network or one with an artificial diffusion term. Although we observed weak convergence of our solution, we should appeal to the theory of singular perturbation to factor in the boundary condition of a thin layer.
There is room for improvement in optimization procedure. We are planning to explore Bayesian optimization approaches that we have already attempted using ODEbased neural networks [33]. Therefore, it is important to discuss the PACBayes perspective of the proposed model as well.
Additionally, we intend to extend our PDEbased neural network to multidimensional Euclidean spaces. As stated in Remark 5 at the end of Sect. 4, this is necessary when considering a GNN in which the elements are treated in the matrix form.
Availability of data and materials
The datasets generated and/or analyzed during the current study are available in: (i) UCI Machine Learning Repository, [https://doi.org/10.24432/C5XW20], (ii) Kaggle repository, [https://www.kaggle.com/datasets/akshaydattatraykhare/diabetesdataset].
References
Aizawa, Y., Kimura, M.: Universal approximation properties for ODENet and ResNet. CoRR (2021). arXiv:2101.10229
Annunziato, M., Borzì, A.: A Fokker–Planck control framework for multidimensional, stochastic processes. J. Comput. Appl. Math. 237, 487–507 (2013). https://doi.org/10.1016/j.cam.2012.06.019
Baker, G.A., Bramble, J.H., Thomee, V.: Single step Galerkin approximations for parabolic problems. Math. Comput. 31, 818–847 (1977). https://doi.org/10.2307/2006116
Barbu, V.: Analysis and Control of Nonlinear Infinite Dimensional Systems. Academic Press, London (2012)
Barry, B., Ronny, K.: Adult income dataset, UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20
Baum, E.B., Haussler, D.: What size net gives valid generalization? Neural Comput. 1, 151–160 (1989). https://doi.org/10.1162/neco.1989.1.1.151
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Singapore (2006)
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. (1998). https://doi.org/10.1023/A:1009715923555
Chamberlain, B.P., et al.: GRAND: graph neural diffusion. In: Proc. ICML 2021 (2021)
Chen, R.T.Q., et al.: Neural ordinary differential equations. Adv. Neural Inf. Process. Syst. 31, 6572–6583 (2018)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2, 303–314 (1989). https://doi.org/10.1007/BF02551274
Dautray, R., Lions, L.J.: Mathematical Analysis and Numerical Methods for Science and Technology, vol. 5. Springer, Berlin (1991)
Deap (2023). https://deap.readthedocs.io/en/master/
DeVore, R., Hanin, B., Petrova, G.: Neural network approximation. Acta Numer. 30, 327–444 (2021). https://doi.org/10.1017/S0962492921000052
Diabetes dataset: Kaggle (2020). https://www.kaggle.com/datasets/akshaydattatraykhare/diabetesdataset
Dudley, R.M.: Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge (1999). https://doi.org/10.1017/CBO9780511665622
Dupont, E., Doucet, A., Teh, Y.W.: Augmented neural ODEs. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’AlchéBuc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Red Hook (2019)
EsteveYagüe, C., et al.: Largetime asymptotics in deep learning (2021). https://hal.archivesouvertes.fr/hal02912516
EsteveYagüe, C., Geshkovski, B.: Sparse approximation in learning via neural ODEs. (2021). arXiv:2102.13566
FernándezCara, E., et al.: Null controllability of linear heat and wave equations with nonlocal spatial terms. SIAM J. Control Optim. 54, 2009–2019 (2016). https://doi.org/10.1137/15M1044291
Fujita, H., Mizutani, A.: On the finite element method for parabolic equations, I; approximation of holomorphic semigroups. J. Math. Soc. Jpn. 28, 749–771 (1976). https://doi.org/10.2969/jmsj/02840749
Funahashi, K.: On the approximate realization of continuous mappings by neural networks. Neural Netw. 2, 183–192 (1989). https://doi.org/10.1016/08936080(89)900038
Funahashi, K., Nakamura, Y., Networks, N.: Neural Networks, Approximation Theory, and Dynamical Systems (Structure and Bifurcation of Dynamical Systems), Suurikaiseki kenykuujo Kokyuroku, 18–37 (1992). http://hdl.handle.net/2433/82914
Geshkovski, B., Zuazua, E.: Turnpike in optimal control of PDEs, ResNets, and beyond. Acta Numer. 31, 135–263 (2022). https://doi.org/10.1017/S0962492922000046
Giné, E., Nickl, R.: Mathematical Foundations of InfiniteDimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (2015). https://doi.org/10.1017/CBO9781107337862
GonzálezBurgos, M., de Teresa, L.: Some results on controllability for linear and nonlinear heat equations in unbounded domains. Adv. Differ. Equ. 12, 1201–1240 (2007). https://doi.org/10.57262/ade/1355867413
Haber, E., Ruthotto, L.: Stable architectures for deep neural networks. Inverse Probl. 34, 014004 (2017). https://doi.org/10.1088/13616420/aa9a90
Han, E.W., Han, J., Li, Q.: A meanfield optimal control formulation of deep learning. Res. Math. Sci. 6, 10 (2019). https://doi.org/10.1007/s406870180172y
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE Comput. Soc., Los Alamitos (2016). https://doi.org/10.1109/CVPR.2016.90
Hoff, D., Smoller, J.: Error bounds for finitedifference approximations for a class of nonlinear parabolic systems. Math. Comput. 45, 35–49 (1985). https://doi.org/10.2307/2008048
Honda, H.: On continuous limit of neural network. In: Proc. of NOLTA 2020 (2020)
Honda, H.: On a partial differential equation based neural network. IEICE Commun. Express 10, 137–143 (2021). https://doi.org/10.1587/comex.2020XBL0174
Honda, H., et al.: An ODEbased neural network with bayesian optimization. JSIAM Lett. 15, 101–104 (2023). https://doi.org/10.1587/comex.2020XBL0174
Honda, H.: Approximating a multilayer neural network by an optimal control of a partial differential equation. Preprint
Hornik, K., et al.: Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989). https://doi.org/10.1016/08936080(89)900208
Irie, B., Miyake, S.: Capabilities of threelayered perceptrons. In: Proc. IEEE Int. Conf. on Neural Networks, pp. 641–648 (1988). https://doi.org/10.1109/ICNN.1988.23901
Ito, S.: Fundamental solutions of parabolic differential equations and boundary value problems. Jpn. J. Math., Trans. Abstr. 27, 55–102 (1957). https://doi.org/10.4099/jjm1924.27.055
Kac, V.G., Cheung, P.: Quantum Calculus. Springer, New York (2001)
Kato, T.: Perturbation Theory for Linear Operators, 2nd edn. Springer, New York (1976)
Koenderink, J.J.: The structure of images. Biol. Cybern. 50, 363–370 (1984). https://doi.org/10.1007/BF00336961
Kolmogorov, A.N.: On the representation of continuous function of many variables by superposition of continuous function of one variable and addition. Dokl. Akad. Nauk SSSR 144, 679–681 (1957)
Laakmann, F., Petersen, P.C.: Efficient approximation of solutions of parametric linear transport equations by ReLU DNNs. Adv. Comput. Math. 47, 11 (2021)
Leshno, M., et al.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6, 303–314 (1993). https://doi.org/10.1016/S08936080(05)801315
Li, Q., et al.: Maximum principle based algorithms for deep learning. J. Mach. Learn. Res. 18, 5998–6026 (2017)
Li, Q., Lin, T., Shen, Z.: Deep learning via dynamical systems: an approximation perspective. J. Eur. Math. Soc. (2019). https://doi.org/10.4171/jems/1221
Li, Z., Shi, Z.: Deep residual learning and PDEs on manifold (2017). arXiv:1708.05115
Lions, J.L.: Perturbations Singulières dans les Problèmes aux Limites et en Contrôle Optimal. Springer, Berlin (1973)
Lions, J.L.: Exact controllability, stabilization and perturbations for distributed systems. SIAM Rev. 30, 1–68 (1988). https://doi.org/10.1137/1030001
Lions, J.L., Magenes, E.: Nonhomogeneous Boundary Values Problems and Applications I. Springer, Berlin (1972)
Lions, P.L.: Une vision mathématique du Deep Learning (2018). https://www.collegedefrance.fr/fr/agenda/seminaire/mathematiquesappliquees/unevisionmathematiquedudeeplearning
Lippmann, R.: An introduction to computing with neural nets. IEEE ASSP Mag. 4, 4–22 (1987). https://doi.org/10.1109/MASSP.1987.1165576
Liu, H., Markowich, P.: Selection dynamics for deep neural networks. J. Differ. Equ. 269, 11540–11574 (2020). https://doi.org/10.1016/j.jde.2020.08.041
Lohéac, J., Zuazua, E.: From averaged to simultaneous controllability. Ann. Fac. Sci. Toulouse, Math. 25, 785–828 (2016)
Neal, R.M.: Bayesian Learning for Neural Networks. Springer, Berlin (1996)
Nirenberg, L.: Topics in Nonlinear Functional Analysis. Am. Math. Soc., Providence (2001)
Oono, K., Suzuki, T.: Graph neural networks exponentially lose expressive power for node classification (2020). https://api.semanticscholar.org/CorpusID:209994765
Pachpatte, B.G., Ames, W.F.: Inequalities for Differential and Integral Equations. Academic Press, London (1997)
Perona, P., Malik, J.: Scalespace and edge detection using anisotropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 12, 629–639 (1990). https://doi.org/10.1109/34.56205
Rodriguez, I.D.J., Ames, A.D., Yue, Y.: Lyanet: a Lyapunov framework for training neural ODEs. CoRR (2022). arXiv:2202.02526
Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–408 (1958)
RuizBalet, D., Zuazua, E.: Neural ODE control for classification, approximation and transport. SIAM Rev. 65, 735–773 (2023). https://doi.org/10.1137/21M1411433
Rusch, T.K., et al.: Graphcoupled oscillator networks. CoRR (2022). arXiv:2202.02296
Ruthotto, L., Haber, E.: Deep neural networks motivated by partial differential equations. J. Math. Imaging Vis. 62, 352–364 (2020). https://doi.org/10.1007/s10851019009031
Ryu, S.U., Yagi, A.: Optimal control of Keller–Segel equations. J. Math. Anal. Appl. 256, 45–66 (2001)
ShalevShwartz, S., BenDavid, S.: Understanding Machine Learning. Cambridge University Press, Padstow Cornwall (2014)
Shen, Z., Yang, H., Zhang, S.: Nonlinear approximation via compositions. CoRR (2019). arXiv:1902.10170
Sonoda, S., Murata, N.: Double continuum limit of deep neural networks. In: Proc. of ICML 2017, Workshop on Principled Approaches to Deep Learning (2017)
Sonoda, S., Murata, N.: Transport analysis of infinitely deep neural network. J. Mach. Learn. Res. 20, 1–52 (2019)
Sontag, E., Sussmann, H.: Complete controllability of continuoustime recurrent neural networks. Syst. Control Lett. 30, 177–183 (1997). https://doi.org/10.1016/S01676911(97)000029
Sprecher, D.A.: On the structure of continuous functions of several variables. Trans. Am. Math. Soc. 115, 340–355 (1965). https://doi.org/10.2307/1994273
Statsmodels (2023). https://www.statsmodels.org/
Stelzer, F., et al.: Deep neural networks using a single neuron: foldedintime architecture using feedbackmodulated delay loops. Nat. Commun. 12, 1–10 (2021). https://doi.org/10.1038/s41467021254274
Tabuada, P., et al.: Universal approximation power of deep residual neural networks through the lens of control. IEEE Trans. Autom. Control 68, 2715–2728 (2023). https://doi.org/10.1109/TAC.2022.3190051
Temam, R.: InfiniteDimensional Dynamical Systems in Mechanics and Physics. Springer, New York (1997)
Teshima, T., et al.: Couplingbased invertible neural networks are universal diffeomorphism approximators. CoRR (2020). arXiv:2006.11469
Teshima, T., et al.: Universal approximation property of neural ordinary differential equations (2020). arXiv:2012.02414
Thomée, V.: Galerkin Finite Element Methods for Parabolic Problems. Springer, Berlin (2006)
Thorpe, M., van Gennip, Y.: Deep limits of residual neural networks. Res. Math. Sci. 10, 6 (2023). https://doi.org/10.1007/s4068702200370y
Trotter, H.F.: Approximation of semigroups of operators. Pac. J. Math. 8, 887–919 (1958)
Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, New York (1996). https://doi.org/10.1007/9781475725452
Vainikko, G.: Funktionalanalysis der Diskretisierungsmethoden. Teubner, Leipzig (1976)
Weickert, J.: Anisotropic Diffusion in Image Processing (1998). https://www.mia.unisaarland.de/weickert/Papers/book.pdf
Weinan, E.: A proposal on machine learning via dynamical systems. Commun. Math. Stat. 5, 1–11 (2017). https://doi.org/10.1007/s403040170103z
Williams, C.: Computing with infinite networks. In: Mozer, M., Jordan, M., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9. MIT Press, Cambridge (1996)
Yun, B.I.: A neural network approximation based on a parametric sigmoidal function. Mathematics 7, 262 (2019). https://www.mdpi.com/22277390/7/3/262
Yunjin, C., Thomas, P.: Trainable nonlinear reaction diffusion: a flexible framework for fast and effective image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1256–1272 (2017). https://doi.org/10.1109/TPAMI.2016.2596743
Zeidler, E.: Nonlinear Functional Analysis and Its Applications. Springer, New York (1986)
Zhang, H., et al.: Approximation capabilities of neural ODEs and invertible residual networks. In: Daumé, H., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 11086–11095 (2020)
Acknowledgements
We thank the anonymous reviewers whose comments and suggestions greatly help improve and clarify this manuscript. We also appreciate Mamoru Miyazawa, who contributed to the numerical experiments in this study.
Funding
This work was supported by Toyo University Top Priority Research Program.
Author information
Authors and Affiliations
Contributions
Only one author for this paper. The author read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
There are no ethics approval.
Competing interests
The author declares no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Summary of notations
We summarize the notations used in this paper, which are not presented in Sect. 2 in Table 4 below.
Appendix B: Proofs of existence
2.1 B.1 Proof of Theorem 1
Before introducing our first result, we shall define the Galerkin approximation [12].
Definition 5
Let V be a separable Hilbert space and \(\{V_{m}\}_{m=1}^{\infty}\) be a family of finite dimensional vector spaces satisfying the assumptions (i) and (ii) below.

(i)
\(V_{m} \subset V\), \(\operatorname{dim}V_{m} < +\infty \).

(ii)
\(V_{m} \rightarrow V \) (\(m \rightarrow \infty \)) in the sense below: there exists a dense subspace of V, every element v of which has a corresponding sequence \(\{v_{m}\}_{m=1}^{\infty }\subset V_{m}\) satisfying \(\v_{m}v\_{V} \rightarrow 0\) (\(m \rightarrow +\infty \)).
Then, each space \(V_{m}\) (\(m=1,2,\ldots \)) is called the Galerkin approximation of order m of V.
Now, we prove Theorem 1. First, let us introduce a space
We note the fact that (see, [12], Chapter XVIII, Theorem 1):
holds. We shall seek a \(T_{u_{0}}\) and \(v \in \mathfrak{W}(T_{u_{0}})\) that solves (3.3) in the following sense:
in the sense of \(( C_{0}^{\infty}(0,T) )^{\prime}\) for all \(w \in H^{1}(I)\). Note that due to (B.1), the initial condition in the second equation of (B.2) has a meaning. We shall prove this in the following steps [12, 64]. First, assuming the temporally local solvability of the problem, we prove the uniqueness of the local solution. Second, we prove the existence of a local solution up to a certain time \(T_{u_{0}}\). Let us assume that we have temporally local two solutions to (B.2) on a time interval \([0,T^{*}]\), say, \(v^{(1)}\) and \(v^{(2)}\), which belong to the space mentioned in Theorem 1 and subsequent Remark 1.
We introduce a notation \(\tilde{v} \equiv v_{(1)}v_{(2)}\). This should satisfy:
where
Replacing w with \(\tilde{v}(t,x)\) on both sides of (B.3), and applying the Schwartz’s inequality, we observe:
where \(L>0\) is the Lipschitz constant of \(\phi (\cdot )\). This, together with the Gronwall’s inequality [57] and the fact that \(\tilde{v} _{t=0}=0\), yields
which implies the uniqueness of the solution.
Next, we prove the existence of a local solution. Let \(\{V_{m}\}_{m=1}^{\infty}\) be an increasing family of \(d_{m}\) dimensional subspaces of \(H^{1}(I)\), in which each \(v\in H^{1}(I)\) has its approximating sequence \(\{v^{(m)}\}_{m=1}^{\infty}\) such that \(v^{(m)} \in V_{m}\) for each m, and \(\v^{(m)}v\_{H^{1}(I)}\rightarrow 0\) as \(m \rightarrow \infty \). Because \(V_{m}\) is a Galerkin approximation of \(L_{2}(I)\) as well, we have a sequence \(\{\tilde{u}_{0m}\}_{m=1}^{\infty}\) such that
Let \(\{W_{jm}\}_{j=1}^{d_{m}}\) be a basis in \(V_{m}\). We seek \(v^{(m)}\) and \(\tilde{u}_{0m}\) of the form of linear combinations of \(\{W_{jm}\}_{j=1}^{d_{m}}\) that solve
Because \(W_{jm}\) are linearly independent with each other, (B.4) is assured to have a local solution \(v^{(m)} \in C(0,T_{u_{0}};V_{m})\) with some \(T_{u_{0}}\). It also satisfies \(\frac{\mathrm{d}v^{(m)}}{\mathrm{d}t} \in L_{2}(0,T_{u_{0}};V_{m})\) under the assumptions of the theorem.
Next, we observe the apriori estimate. Let us multiply the coefficient of \(v^{(m)}\) on both sides of (B.4) for each j, and sum up with respect to \(j=1,2,\ldots ,d_{m}\). Then, we have
Regarding the righthand side, with a notation \(c_{1} =  \phi (0) ^{2} \), we estimate from above as follows.
This yields
from which, together with the Gronwall’s inequality again, we obtain
This enables us to extract a subsequence \(\{v^{(m^{\prime})}\} \subset \{v^{(m)}\}\) satisfying the following issues with some \(v_{\infty }\in L_{2}(I)\):
In virtue of the Relich’s theorem, we have
Now, we are in a position to check that this \(v_{\infty}\) certainly solves (B.2). In order for this, we take an arbitrary smooth function \(\zeta (t) \in C_{0}^{\infty}(0,T)\) and \(\breve{w} \in H^{1}(I)\), a sequence \(\{w_{m}\}_{m} \subset H^{1}(I)\) satisfying
and define \(\psi _{m}\equiv \zeta (t)w_{m}\) and \(\psi \equiv \zeta (t)\breve{w}\) (note that because we consider in onedimensional space where \(H^{1}(I)\) can be embedded into \(C(I)\), we can regard \(\mathfrak{V} = H^{1}(I)\) in Definition 1 [12]). It is clear that as \(m \rightarrow +\infty \),
For now, we can replace m above with \(m^{\prime}\) prescribed. Thus, from (B.4), after integration by parts (note that \(\zeta (t) \in C_{0}^{\infty}(0,T)\)), we have
In virtue of (B.5) and (B.6), as \(m \rightarrow +\infty \), we have
The equality above holds for any \(\breve{w} \in H^{1}(I)\), and thus, this \(v_{\infty}\) solves (B.2).
Now, (B.8) can be rewritten as follows.
We can easily see
which, together with (B.1), yields the fact that \(v_{\infty}\) belongs to the same space mentioned in Theorem 1 and subsequent Remark 1.
Finally, we verify that \(v_{\infty}\) above satisfies the initial condition. Let \(\eta (t) \in C^{\infty}(0,T_{u_{0}})\) be a function that satisfies \(\eta (t)=0\) near \(T_{u_{0}}\) and \(\eta (0) \ne 0\). We again take \(\breve{w} \in H^{1}(I)\) and a sequence \(\{w_{m}\}_{m} \subset H^{1}(I)\) satisfying
Then, \(\psi = \eta (t) \breve{w} \in \mathfrak{W}(T_{u_{0}})\) and by integration by parts, we have
From Equation (B.2), we can derive
Moreover, from (B.4) we have
The lefthand side of (B.11) has another representation:
Making \(m^{\prime}\) tend to +∞, on the one hand, (B.11) yields
where we used (B.12) to deduce the rightmost hand side. On the other hand, (B.12) yields
By comparing (B.12), (B.13), and (B.14), we arrive at
Because \(H^{1}(I)\) is dense in \(L_{2}(I)\), (B.15) holds for all \(\breve{w} \in L_{2}(I)\), which implies
This is the desired result.
2.2 B.2 Proof of Theorem 2
Here, we prove Theorem 2. Because the local solvability is assured in Theorem 1, we assume that for some \(T^{*}>0\), we have a solution v of (3.3) on the interval \([0,T^{*}]\). Now, let us first construct a variable:
with \(\lambda \in {\mathbb{R}}\) specified later, which solves
By multiplying v̆ on both sides in (B.16), we can deduce an estimation as below:
By introducing a notation \(c_{1} =  \phi (0) ^{2} \) again, by applying the Schwartz inequality and the Lipschiz continuity of ϕ, we have
if we substitute this to (B.17), we obtain
If we denote
by the Gronwall’s inequality, we have
Applying the Schwartz’s inequality to the righthand side of (B.19) and taking λ so that
holds, then (B.19) yields
This implies that the norm \( \breve{v}(T^{*}) \) does not depend on \(T^{*}\). Tracing the same argument as in [64], we have the statement of the theorem.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Honda, H. Universal approximation property of a continuous neural network based on a nonlinear diffusion equation. Adv Cont Discr Mod 2023, 43 (2023). https://doi.org/10.1186/s1366202303787z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1366202303787z