Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities.  Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory.
The law of large numbers says that, for each single event  , its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class
, its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class  from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events
 from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events  [1] The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.
 [1] The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.
Definitions
For a class of predicates  defined on a set
 defined on a set  and a set of samples
 and a set of samples  , where
, where  , the empirical frequency of
, the empirical frequency of  on
 on  is
 is
 
The theoretical probability of  is defined as
 is defined as  
The Uniform Convergence Theorem states, roughly, that if  is "simple" and we draw samples independently (with replacement) from
 is "simple" and we draw samples independently (with replacement) from  according to any distribution
 according to any distribution  , then with high probability, the empirical frequency will be close to its expected value, which is the theoretical probability.[2]
, then with high probability, the empirical frequency will be close to its expected value, which is the theoretical probability.[2]
Here "simple" means that the Vapnik–Chervonenkis dimension of the class  is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.
 is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.
The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis[1] using the concept of growth function.
The statement of the uniform convergence theorem is as follows:[3]
If  is a set of
 is a set of  -valued functions defined on a set
-valued functions defined on a set  and
 and   is a probability distribution on
 is a probability distribution on  then for
 then for  and
 and  a positive integer, we have:
 a positive integer, we have:
 
- where, for any  , ,
 
 
- and  . . indicates that the probability is taken over indicates that the probability is taken over consisting of consisting of i.i.d. draws from the distribution i.i.d. draws from the distribution . .
 is defined as: For any is defined as: For any -valued functions -valued functions over over and and , ,
 
And for any natural number  , the shattering number
, the shattering number  is defined as:
 is defined as:
 
From the point of Learning Theory one can consider  to be the Concept/Hypothesis class defined over the instance set
 to be the Concept/Hypothesis class defined over the instance set  . Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.
. Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.
Sauer–Shelah lemma
The Sauer–Shelah lemma[4] relates the shattering number  to the VC Dimension.
 to the VC Dimension.
Lemma:  , where
, where  is the VC Dimension of the concept class
 is the VC Dimension of the concept class  .
.
Corollary:  .
.
[1] and [3] are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.
- Symmetrization: We transform the problem of analyzing  into the problem of analyzing into the problem of analyzing , where , where and and are i.i.d samples of size are i.i.d samples of size drawn according to the distribution drawn according to the distribution . One can view . One can view as the original randomly drawn sample of length as the original randomly drawn sample of length , while , while may be thought as the testing sample which is used to estimate may be thought as the testing sample which is used to estimate . .
- Permutation: Since  and and are picked identically and independently, so swapping elements between them will not change the probability distribution on are picked identically and independently, so swapping elements between them will not change the probability distribution on and and . So, we will try to bound the probability of . So, we will try to bound the probability of for some for some by considering the effect of a specific collection of permutations of the joint sample by considering the effect of a specific collection of permutations of the joint sample . Specifically, we consider permutations . Specifically, we consider permutations which swap which swap and and in some subset of in some subset of . The symbol . The symbol means the concatenation of means the concatenation of and and . .
- Reduction to a finite class: We can now restrict the function class  to a fixed joint sample and hence, if to a fixed joint sample and hence, if has finite VC Dimension, it reduces to the problem to one involving a finite function class. has finite VC Dimension, it reduces to the problem to one involving a finite function class.
We present the technical details of the proof.
Symmetrization
Lemma: Let  and
 and
 
Then for  ,
,  .
.
Proof: 
By the triangle inequality,
 
if  and
 and  then
 then  .
.
Therefore,
![{\displaystyle {\begin{aligned}&P^{2m}(R)\\[5pt]\geq {}&P^{2m}\{\exists h\in H,|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon {\text{ and }}|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2\}\\[5pt]={}&\int _{V}P^{m}\{s:\exists h\in H,|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon {\text{ and }}|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2\}\,dP^{m}(r)\\[5pt]={}&A\end{aligned}}}](./_assets_/f8461f48d2fd7f0c2fe19203c871b9e67c8faf25.svg) 
since  and
 and  are independent.
 are independent.
Now for  fix an
 fix an  such that
 such that  . For this
. For this  , we shall show that
, we shall show that
 
Thus for any  ,
,  and hence
 and hence  . And hence we perform the first step of our high level idea.
. And hence we perform the first step of our high level idea.
Notice,  is a binomial random variable with expectation
 is a binomial random variable with expectation  and variance
 and variance  . By Chebyshev's inequality we get
. By Chebyshev's inequality we get
 
for the mentioned bound on  . Here we use the fact that
. Here we use the fact that  for
 for  .
.
Permutations
Let  be the set of all permutations of
 be the set of all permutations of  that swaps
 that swaps  and
 and  
  in some subset of
 in some subset of  .
.
Lemma: Let  be any subset of
 be any subset of  and
 and  any probability distribution on
 any probability distribution on  . Then,
. Then,
![{\displaystyle P^{2m}(R)=E[\Pr[\sigma (x)\in R]]\leq \max _{x\in X^{2m}}(\Pr[\sigma (x)\in R]),}](./_assets_/9c1d969a0460869584bfda8851b74b1346b7bf94.svg) 
where the expectation is over  chosen according to
 chosen according to  , and the probability is over
, and the probability is over  chosen uniformly from
 chosen uniformly from  .
.
Proof: 
For any  
 
(since coordinate permutations preserve the product distribution  .)
.)
![{\displaystyle {\begin{aligned}\therefore P^{2m}(R)={}&\int _{X^{2m}}1_{R}(x)\,dP^{2m}(x)\\[5pt]={}&{\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}\int _{X^{2m}}1_{R}(\sigma (x))\,dP^{2m}(x)\\[5pt]={}&\int _{X^{2m}}{\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}1_{R}(\sigma (x))\,dP^{2m}(x)\\[5pt]&{\text{(because }}|\Gamma _{m}|{\text{ is finite)}}\\[5pt]={}&\int _{X^{2m}}\Pr[\sigma (x)\in R]\,dP^{2m}(x)\quad {\text{(the expectation)}}\\[5pt]\leq {}&\max _{x\in X^{2m}}(\Pr[\sigma (x)\in R]).\end{aligned}}}](./_assets_/9f46a36c4d92d17e15ec97d1e17ba97a92ab9600.svg) 
The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.
Reduction to a finite class
Lemma: Basing on the previous lemma,
![{\displaystyle \max _{x\in X^{2m}}(\Pr[\sigma (x)\in R])\leq 4\Pi _{H}(2m)e^{-\varepsilon ^{2}m/8}}](./_assets_/c488981fe1f7d0ee58c749184eadf06c348bebd7.svg) . .
Proof:
Let us define  and
 and  which is at most
 which is at most  . This means there are functions
. This means there are functions  such that for any
 such that for any  between
 between  and
 and  with
 with  for
 for  
We see that  iff for some
 iff for some  in
 in  satisfies,
 satisfies,
 .  
Hence if we define
.  
Hence if we define  if
 if  and
 and  otherwise.
 otherwise.
For  and
 and  , we have that
, we have that  iff for some
 iff for some  in
 in  satisfies
 satisfies  . By union bound we get
. By union bound we get
![{\displaystyle \Pr[\sigma (x)\in R]\leq t\cdot \max \left(\Pr[|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)|\geq {\frac {\varepsilon }{2}}]\right)}](./_assets_/cc40ac1815ef47101a8440815d7aeccd74307766.svg) 
![{\displaystyle \leq \Pi _{H}(2m)\cdot \max \left(\Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)\right|\geq {\frac {\varepsilon }{2}}\right]\right).}](./_assets_/076d1837b87ec317dca829e524a12af39a7ca5c4.svg) 
Since, the distribution over the permutations  is uniform for each
 is uniform for each  , so
, so  equals
 equals  , with equal probability.
, with equal probability.
Thus,
![{\displaystyle \Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}\left(w_{\sigma _{i}}^{j}-w_{\sigma _{m+i}}^{j}\right)\right)\right|\geq {\frac {\varepsilon }{2}}\right]=\Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}|w_{i}^{j}-w_{m+i}^{j}|\beta _{i}\right)\right|\geq {\frac {\varepsilon }{2}}\right],}](./_assets_/48244ac88425a73a9af7fe7ed2924fdc9a3600a9.svg) 
where the probability on the right is over  and both the possibilities are equally likely. By Hoeffding's inequality, this is at most
 and both the possibilities are equally likely. By Hoeffding's inequality, this is at most  .
.
Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.
References
- ^ a b c Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. doi:10.1137/1116025.
This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968.
The translation was reproduced as:
Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3. ISBN 978-3-319-21851-9.
- ^ "Martingales", Probability with Martingales, Cambridge University Press, pp. 93–105, 1991-02-14, doi:10.1017/cbo9780511813658.014, ISBN 978-0-521-40455-6, retrieved 2023-12-08
- ^ a b Martin Anthony Peter, l. Bartlett. Neural Network Learning: Theoretical Foundations, pages 46–50. First Edition, 1999. Cambridge University Press ISBN 0-521-57353-X
- ^ Sham Kakade and Ambuj Tewari, CMSC 35900 (Spring 2008) Learning Theory, Lecture 11