Generalized kinetic theory of coarse-grained systems. I. Partial equilibrium and Markov approximations

The general kinetic theory of coarse-grained systems is presented in the abstract formalism of communication theory developed by Shannon and Weaver, Khinchin and Kolmogorov. The martingale theory shows that, under reasonable, general hypotheses, coarse-grained systems can be approximated by generalized Markov systems. For mixing systems, the Kolmogorov entropy production can be deﬁned for nonstationary processes as Kolmogorov deﬁned it for stationary processes.


Introduction
The purpose of this article (and the one to follow) is to define a generalized kinetic theory of classical thermodynamic systems at a coarse-grained level (see Sect. 2 for definitions).The microscopic evolution of the system induces an evolution on the coarsegrained states, which is generally non-Markovian.
In the same context, it has been shown recently [1] that coarse-grained deterministic dynamical systems can be approximated by generalized Markov systems, which may explain why Markov processes are so popular in modeling actual phenomena.These conclusions were obtained by applying and extending some pioneering results of Kolmogorov [2][3][4].The formalism used in our previous works was relatively intuitive, even if sometimes lengthy, but it was sufficient for our first aim.However, we had to adopt some hypotheses that could seem reasonable but were difficult to justify precisely.
In the present work, we adopt a more abstract and rigorous formalism and show that the previous results can be generalized to a much broader framework, as mentioned above.This formalism is the one used in communication theory by Shannon-Weaver [5] and by Khinchin [6] to define optimal coding, by Kolmogorov [2,3] (and also [4] for a pedagogical exposition) to define entropic invariants of dynamical systems.It was also introduced in [7] in the Markovian situation only.
The system evolution is specified by a stationary distribution on the path space X Z , where X is the finite set of coarse-grained states, and Z represents the discrete time (see Sect. 2).At the coarse-grained level, the stationary evolution is not Markovian, but the advantage is that the evolution takes place on the finite state space X and that we avoid all controversial discussions concerning ergodicity and time scales for reaching equilibrium [8,9].
Sections 2 and 3 fix notations and definitions and give basic examples.Section 4 introduces nonstationary processes: the initial condition is not the stationary state on X, but the evolution is given by the stationary process.It corresponds to the notion of partial equilibrium of Landau-Lifschitz [10].We define the entropy production of both processes and show that they are equal, assuming a mixing property in Sect. 5.Although this result seems obvious, its proof is quite lengthy.
In Sect.6, we address the main question of kinetic theory, namely why the evolution can be approximated by a Markovian evolution, as in the theory of Brownian motion, Fokker-Planck equation, etc. Obviously, one also has to use a coarse-grained time scale.We define various Markovian evolutions and prove that they approximate the exact evolution on the coarse-grained time scale using the production of relative entropy.
We want to dedicate this article to the memory of Prof. Mark Kac, who introduced one of us to the problems of justification of the Markov processes in statistical mechanics.

Notations and definitions
In this article, X denotes a finite set.Elements of X are denoted as x ∈ X. Z is the set of positive, 0, or negative integers.
If J ⊂ I is a subset of I, x(I)| J is the restriction of the map x(I) to the subset J.
Finally, if m ≤ n, and I = {m, m + 1, . . ., n} is the interval of integers between m and n, we denote (2.5)

Probabilities on X Z
A stochastic process on X is the data of a system of probabilities p I on X I or all finite set I ⊂ Z with the compatibility conditions: if J ⊂ I and x(J) ∈ X J , then Obviously, the p I are known as soon as the p [m,n] are known.
It is known that a system of probabilities p I satisfying the compatibility conditions of Eq. (2.6) defines a probability p sur X Z , the p I being the marginal laws of This result is the extension Theorem of Kolmogorov [11].The probability p is defined on the measurable subsets of X Z .By definition, the subset appearing in p in Eq. (2.7) is measurable.
The stochastic process p is stationary if for any measurable set or equivalently, for any x(I) ∈ X I , and any In particular, if p is stationary, it defines a unique probability distribution p 0 on X by Remark (Conventions and definition) (i) In order to simplify the notations, we shall skip the index I of p I whenever it is clear that we refer to p I .For instance, we write p(x(I)) instead of p I (x(I)).
(ii) When we use conditional probabilities, the condition is always in the past: According to the usual definition [4], the stochastic process p is ergodic if any measurable set B ⊂ X Z is invariant by τ having probability p(B) = 0 or 1.

Coarse-graining
Let A be a partition of X: the elements of a ∈ A are subsets of a ⊂ X such that (2.9) A probability q on X generates a probability q (A) on A by

11)
The partition A on X induces the partitions A Z of X Z and A I of X I .The stochastic process p induces a stochastic process p (A) defined by where the notation x(I) ∈ a(I) means

13)
p (A) is a coarse-grained process of p.If p is stationary, p (A) is stationary.If p is ergodic, p (A) is ergodic.Such coarse-grained processes are extensively used in physics and applied sciences when inaccurate observations cannot allow one to distinguish two different elements x belonging to the same subset a of A [1,12].

Examples
We only cite a few well-known processes that are of interest to us.
(a) Bernoulli processes Let μ be a probability on X.The Bernoulli process defined by μ is Let μ be a stationary probability for R: Then, we define a stochastic process by where This process is stationary.It is ergodic if and only if R is irreducible.The Bernoulli process (a) is a particular case when R yx = μ(x).
(c) Dynamical systems These systems are of special interest to physics (see [1] and Remark 2 below).Let (M, M, μ) be a probability space so that M is a measurable space with a σ -algebra M of measurable subsets and μ a probability defined on M.
Let f : M → M be a measurable bijection, which is measure-preserving, namely Let X be a finite partition of M in measurable subsets.We define a coarse-grained, stochastic process on X by the formula where x[m, n] ∈ X [m,n] is given by Eq. (3.5).
Then p [m,n] (x[m, n]) is the measure of the subset of elements z ∈ M with This process is stationary.It is ergodic if f is ergodic (i.e., if the only measurable subset B of M invariant by f is of measure μ(B) = 0 or 1).
Remark 1 This definition, due to Kolmogorov, was introduced to define nonspatial invariants of dynamical systems [4].
Remark 2 A particularly interesting example [4] is the case when M is a phase space, and f is a Hamiltonian map (i.e., the map given by the solution of the Hamilton equation at a given time) and μ is the volume on M, which is preserved by f because of the Liouville theorem.

Changing initial conditions: definition of the nonstationary process p. Production of entropy 4.1 Definition of a particular nonstationary process p
Let A be a partition of X.The elements of A are subsets a ⊂ X satisfying Eqs.(2.5)-(2.10).
Let p be a stationary process on X, and q be a probability on A. These two data determine a process on X that is a probability p on X N given by the formulas Here 2) show that the distributions p [0,n] and p [m,n] satisfy the compatibility conditions and define a probability on X N , and a stochastic process (induced by the integers ≥ 0) on X.This stochastic process is nonstationary (indeed, being indexed by the ≥0 integers, the stationarity is meaningless).The initial distribution is and the distribution at time n is Convention As previously mentioned, we skip the indices I for p I when there is no possible confusion.

Lemma 4.1
The conditional probabilities of the process p with the condition starting at time 0 are identical to the corresponding conditional probabilities of the process p, so, for The proof is obvious using (4.1).

Entropy and relative entropy
If Z is a finite set, if |Z| is the number of its elements, and if p, q are probabilities on Z, we define the entropy of p and the relative entropy of p and q by the usual formulas [7] One has 0 ≤ S(p) ≤ ln |Z| S(p|q) ≥ 0 and S(p|q) = 0 if and only if p = q.

Path entropy
For the stationary process, the nonstationary process, and any positive integer n, we define the path entropy S n Lemma 4.2 (a) One has the following identities: where and the same identities with p instead of p.
Proof (a) is trivial.On the other hand, one has which is (4.12).Similarly, we derive (4.13) using Lemma 4.1.

Lemma 4.3 (a) For the stationary process p, one has the identity
) is a decreasing sequence with a limit s(p).
Proof Using the definition of d k S (Eq.(4.12), one has by stationarity of p

The case of a stationary probability p
In this case, we will use a theorem that was first presented in Ref. [1], using the concept of martingale (see, for instance, [13,14], or [12] for a simplified definition).For completeness, the proofs of Theorems 4.4 and 4.5 are given in Appendix A.
Remark 3 In the special case where p comes from a dynamical system (Eq.(3.7)), it is proved in [3] that d k S k (p) is a decreasing sequence, and there is no need to use martingale theory.

Non-stationary probability p
In the nonstationary case, the production of entropy at time k is d k S k (p).The asymptotic entropy production of p is well defined if d k S k (p) tends to a limit when k → ∞.As shown later, further hypotheses are necessary for such a limit to exist.In the general case, we can only prove that where C (resp.C ) is the lower (resp.upper) bound of q(x)/p(x) Proof These inequalities straightforwardly result from Eqs. (4.1)-(4.2) and Theorem 4.4.

Production of entropy for a nonstationary distribution p 5.1 Mixing process
We say that the stationary process p is a mixing process if for 0 < n < n + k x(0)∈a p x(0), 0; x(n), n (5.3) But p(x(0), 0; x(n), n) → p(x(0))p(x) by the mixing property.As the sum over a ∈ A is finite, the limit in Eq. (5.3) exists, and it is p(x).The proof of this basic theorem is the consequence of successive partial results, which are postponed to Section 5.4 and completed in Appendix B.

Production of entropy for a mixing process p Theorem If p is a mixing process, one has
(5.4) the sum being taken over x(0) ∈ X and over x[n, n Proof The mixing property (5.1) implies that the conditional probability p [0,n+k] (x(n + k)|x(0), x[n, n + k -1]) has a limit: where the limit is taken with a fixed k and fixed x(0), x(n), . . ., and all these quantities are uniformly bounded by max 0≤α≤1 |α ln α|.As X is finite, we can sum (5.6) on x(n + k) and obtain while staying uniformly bounded.By the Lebesgue theorem of dominated convergence in L 1 (Z, p), we have [15] E p S p .|x(0), where E p {} is the mathematical expectation for the measure p. Now, the first term in Eq. (5.8) is and the last member in Eq. (5.8) js, using (4.12) which proves Eq. (5.4).
Theorem 5. 4 If p is a mixing process, for any probability q in X and for the associated process p defined in Sect.4, one has where k is fixed and the sum is over Proof Using the definition of p and Lemma 4.1, one has where 1 a is the characteristic function of the subset a of X.By the mixing property Eq.(5.1) and Lebesgue theorem of dominated convergence, we have where we have used (4.12).So, by Eq. (5.10), we have proved Eq. (5.9).

Proof of Theorem 5.2
We first derive several successive lemmas.
Lemma 5.5 Let q(x, y, z) be a probability distribution on three variables x, y, z taking discrete values, q(x) and q(x, y) the corresponding marginal laws, q(z|x) and q(z|x, y) the corresponding conditional laws of z.Denote by S Z the entropy of the probability distribution of z.Then, we have the identity x q(x)S Z (q(.|x) - x,y q(x, y)S Z q(.|x, y) = x,y q(x, y)S Z q(.|x, y) |q(.|x) ≥ 0 (5.12) Proof We apply the definitions to the first member to obtain identity (5.12).

Lemma 5.6 One has the identity p(x0), x[n, n
where each summation is over the variables appearing in the concerned probabilities.For instance, the first sum on the left is over x(0), x(n), . . ., x(n + k -1).
Proof of Lemma 5.6 We apply Lemma 5.5 to q = p with the substitutions Lemma 5. 7 We have The proof of this lemma implies that the first member of (5.13) tends to 0 when n and k tend to be infinite, which may seem intuitive from the definition of mixing.However, rigorous proof of Lemma 5.7 requires several further steps, as shown in Appendix B. It allows one to complete the proof of the basic Theorem 5.2.
End of the proof of Theorem 5.2 We start with the identity (5.13) of Lemma 5.6.The second term of the right member of this identity is just We see that the limit when n → ∞ of the first term is, using Theorem 5.4, Eq. (5.9) Now, in the identity (5.13), taking the limits when n → ∞ and then k → ∞, the first member tends to 0 by Lemma 5.7.Taking the same limits in Eq. ( 5.16), its first term tends to s(p).So, the second term of Eq. (5.13) d n+k S n+k (p), has a limit, which is s(p).Thus, we have proved that the nonstationary process p has a production of entropy s(p) = s(p).

Markov approximations 6.1 The process p (T) of memory T associated to p
In general, the process p has an infinite memory.Let T be a positive integer.We define a process p (T) on X of memory T associated to the process p by the formulas Distance between p and p (T) An asymmetric "distance" between p and p (T) for n-step trajectories can be evaluated from the relative entropy of these two processes (see Sect. 6.4.below): This quantity is related to the total variation distance between p [0,n] and p (T) [0,n] , as shown in Sect.6.4.Theorem 6.1 For every ε > 0, there exists a time T ε such that for n ≥ T ≥ T ε , one has So, the distance between p [0,n] and p (T)  [0,n] tends to 0 when n → ∞.
Proof Using the definition of the relative entropy, Eq. (4.7), one has On the other hand, it follows from Eq. (4.9) that -S(p [0,n] ) = -S(p 0 ) - and by definition (6.1) By the stationarity of p, this is From Eqs. (6.4), (6.6), and (6.3), we obtain According to Theorem 4.4, d k S k (p) decreases when k increases, so each term of the sum in Eq. (6.7) is ≥ 0, and which completes the proof of Theorem 6.1.

Partial histories of length T
Definitions A partial history of length T is an element of X T .The nth history of length T is If M and N are positive integers (M < N ), a sequence of partial histories is We also denote by τ (T) the translation of time T on histories of length T.

Theorem 6.2 (a) The process p (T) induces a Markov process p(T) on partial histories of length T by the formulas
x 0, (N + 1)T -1 (6.12) The transition probabilities between histories of length T are where, according to Eq. (6.10) The stationary probability of the Markov process p(T) is p(T) (x (T) (0)).
(c) The production of entropy of p(T) is where S(R(.|x (T) (0)) is the entropy of the probability distribution x Proof Using the definitions of Eqs.(6.13) and (6.1), we have We now show that p(T) (x (T) (0)) is the stationary probability.It will prove that (6.16) is the usual formula [16] for a Markov process (Eq.(3.4) in Sect.3).One has So, p(T) (x (T) (0)) is indeed the stationary probability of the Markov process p(T) .On the other hand, Eq. (6.15) is just the usual formula for the entropy production of a Markov process [16].

Comparison of p and p(T)
The process p induces a stationary process on the partial histories of length T (denoted p T ) by The process p T is exactly the same as p except that it is restricted to an integer number of time T. The entropy production of p T is So, one can rewrite Theorem 6.1 as follows.
Theorem 6.3 Denote by S T (. |.) the relative entropy of two processes defined on histories of length T.Then, for any > 0, there exists T , independent of N , such that for T ≥ T one has

Distance between p T and p(T)
We can interpret this relation as follows.If p, q are probabilities on a finite space Z, the following Pinsker inequality [17] relates the relative entropy of p and q to the total variation distance of distribution p and q: This shows that S(p|q) represents an asymmetrical distance between p and q.Equation (6.20) implies that the absolute distance between the actual process p T and the Markov process p(T) , divided by T, goes to 0 for long times T.

Theorem 6.4 One has for the production of entropy
Proof We use the expression of the entropy production of a Markov process (Eq.(6.15)) and by the stationarity of p and Eq. ( 4.12) However, by Theorem 4.5, d T S T (p) → s(p) if T → ∞, which gives Theorem 6.4.

Attenuation of the memory
We come back to the process p T on histories of length T. We now prove Proof We use the definition of relative entropy and decompose the sum of Eq. (6.22) into two terms: and Both sums (6.23) and (6.24) contain T terms d k S k (p) which tend to s(p).This gives the result (6.22).
As a consequence, if T is large enough, within a given accuracy, ε is possible to neglect the distance between the process at time NT, with complete history from time 0, and the process with history limited to the last period of length T, between times NT and (N -1)T.In practice, one can neglect the memory after times larger than T.

Conclusion
It has been rigorously proved that coarse-graining dynamical systems induce new systems that partially approximate the original systems.This conclusion is often anticipated intuitively in modeling physical or applied phenomena, which most generally needs simplifying and approximating actual observations.Because of its importance, this question has been, for a long time, the matter of many studies (see, for instance [18], and references therein), but it is difficult to obtain both a general approach and exact results on dynamical problems.Recently, it has been shown that innovative concepts introduced by Kolmogorov somewhat sixty years ago can be combined with the martingale theory to yield novel results in this domain.At first, this point of view was applied to classical Hamiltonian systems [1], and a major result was that under appropriate, realistic conditions, coarse-graining systems generate an approximate Markov system.Here, we have seen that the same reasoning applies to much more general, possibly stochastic, processes.Using a purely mathematical formalism, we obtained new, more general conclusions.
In particular, we have proved that the Kolmogorov entropy, introduced by Kolmogorov for ergodic stationary processes [4], also exists for a class of nonstationary processes defined for coarse-grained systems: these processes are obtained by imposing a nonstationary coarse-grained initial probability distribution, whereas the initial conditional distribution remains stationary in each grain.Such nonstationary coarse-grained distributions can be adopted in realistic mesoscopic systems if they are initially constrained to nonequilibrium.In contrast, local equilibrium is almost instantaneously re-established: these approximations are often valid in realistic examples [10,19], which justifies studying this special class of nonstationary processes.Moreover, it has been proved that the asymptotic entropy production of these nonstationary processes is identical to the entropy production of the microscopic stationary process, provided this one is mixing.This is our main result, which allows one to approximate a large class of coarse-grained dynamical processes by Markov processes.
Alternatively, within the framework of the previous general theory, a forthcoming article [20] will present further exact results concerning the comparison of different coarsegrainings of dynamical systems that are of interest for modeling Markov and non-Markov processes.
Appendix A: Proofs of Theorems 4.4 and 4.5 Proof Consider the random variables where E{} is the mathematical expectation for the measure p. So, π k is a martingale on the σ -algebra F N , and by the convergence theorem of martingales [13], it converges almost surely to a limit π when k → ∞.
From the previous theorem, one can deduce the theorem of Kolmogorov-Shannon [4][5][6].Proof (a) We use Eq.(2.13) and the stationarity of p so that  The martingale p(x(0)|x(-nk, -1) is uniformly bounded and converges p-almost surely, and it also integrable, so that, by the convergence theorem of martingales [13]   This concludes the proof of Theorem 5.2.

Theorem 4 . 4 Theorem 4 . 5
For x = x(0) ∈ X, the sequence of random variables p(x|x[-k, -1]) is a martingale with respect to the sequence F k of σ -algebras generated by x([-k, -1]).Moreover, these random variables are positively bounded by 1.Using this theorem, Lemma 4.3(b), and the identity (4.10), one obtains the result of Kolmogorov-Shannon [4, 6]: For the stationary process p, one has (a) d k S k (p) has a limit s(p) for k → ∞ (b) One has lim n→∞ 1 n S n (p) = s(p) (4.14) Definition s(p) is the (asymptotic) production of entropy per unit time of the process p.

p when p is mixing. The main theorem Theorem 5 . 2 ) lim n 1 n
Assume that p is a mixing process.Then (a) d n S n (p) has the limit s(p) (bS n (p) = s(p) Then p has a well-defined production of entropy, which is the same as the entropy production of p.

Theorem 4 . 4
For x = x(0) ∈ X, the sequence of random variables p(x|x[-k, -1]) is a martingale with respect to the sequence F k of σ -algebras generated by x([-k, -1]).Moreover, these random variables are positively bounded by 1.

Theorem 4 . 5
For the stationary process p, one has(a) d k S k (p) has a limit s(p) for k → ∞ (b) One has lim n→∞ 1 n S n (p) = s(p) ( A . 1 )