In information theory, Fano's inequality (also known as the Fano converse and the Fano lemma) relates the average information lost in a noisy channel to the probability of the categorization error.  It was derived by Robert Fano in the early 1950s while teaching a Ph.D. seminar in information theory at MIT, and later recorded in his 1961 textbook.
It is used to find a lower bound on the error probability of any decoder as well as the lower bounds for minimax risks in density estimation.
Let the discrete random variables  and
 and  represent input and output messages with a joint probability
 represent input and output messages with a joint probability  . Let
. Let  represent an occurrence of error; i.e., that
 represent an occurrence of error; i.e., that  , with
, with  being an approximate version of
 being an approximate version of  . Fano's inequality is
. Fano's inequality is
 
where  denotes the support of
 denotes the support of  ,
,  denotes the cardinality of (number of elements in)
 denotes the cardinality of (number of elements in)  ,
,
 
is the conditional entropy,
 
is the probability of the communication error, and
 
is the corresponding binary entropy.
Proof
Define an indicator random variable  , that indicates the event that our estimate
, that indicates the event that our estimate  is in error,
 is in error,
 
Consider  . We can use the chain rule for entropies to expand this in two different ways
. We can use the chain rule for entropies to expand this in two different ways
 
Equating the two
 
Expanding the right most term,  
 
Since  means
 means  ; being given the value of
; being given the value of  allows us to know the value of
 allows us to know the value of  with certainty. This makes the term
 with certainty. This makes the term  .
On the other hand,
.
On the other hand,  means that
 means that  , hence given the value of
, hence given the value of  , we can narrow down
, we can narrow down  to one of
 to one of  different values, allowing us to upper bound the conditional entropy
 different values, allowing us to upper bound the conditional entropy  . Hence
. Hence
 
The other term,  , because conditioning reduces entropy. Because of the way
, because conditioning reduces entropy. Because of the way  is defined,
 is defined,  , meaning that
, meaning that  . Putting it all together,
. Putting it all together,
 
Because  is a Markov chain, we have
 is a Markov chain, we have  by the data processing inequality, and hence
 by the data processing inequality, and hence  , giving us
, giving us
 
Intuition
Fano's inequality can be interpreted as a way of dividing the uncertainty of a conditional distribution into two questions given an arbitrary predictor. The first question, corresponding to the term  , relates to the uncertainty of the predictor. If the prediction is correct, there is no more uncertainty remaining. If the prediction is incorrect, the uncertainty of any discrete distribution has an upper bound of the entropy of the uniform distribution over all choices besides the incorrect prediction. This has entropy
, relates to the uncertainty of the predictor. If the prediction is correct, there is no more uncertainty remaining. If the prediction is incorrect, the uncertainty of any discrete distribution has an upper bound of the entropy of the uniform distribution over all choices besides the incorrect prediction. This has entropy  . Looking at extreme cases, if the predictor is always correct the first and second terms of the inequality are 0, and the existence of a perfect predictor implies
. Looking at extreme cases, if the predictor is always correct the first and second terms of the inequality are 0, and the existence of a perfect predictor implies  is totally determined by
 is totally determined by  , and so
, and so  . If the predictor is always wrong, then the first term is 0, and
. If the predictor is always wrong, then the first term is 0, and   can only be upper bounded with a uniform distribution over the remaining choices.
 can only be upper bounded with a uniform distribution over the remaining choices.
Let  be a random variable with density equal to one of
 be a random variable with density equal to one of  possible densities
 possible densities  . Furthermore, the Kullback–Leibler divergence between any pair of densities cannot be too large,
. Furthermore, the Kullback–Leibler divergence between any pair of densities cannot be too large,
 for all for all 
Let  be an estimate of the index. Then
 be an estimate of the index. Then
 
where  is the probability induced by
 is the probability induced by  .
.
Generalization
The following generalization is due to Ibragimov and Khasminskii (1979), Assouad and Birge (1983).
Let F be a class of densities with a subclass of r + 1 densities ƒθ such that for any θ ≠ θ′
 
 
Then in the worst case the expected value of error of estimation is bound from below,
 
where ƒn is any density estimator based on a sample of size n.
References
- P. Assouad, "Deux remarques sur l'estimation", Comptes Rendus de l'Académie des Sciences de Paris, Vol. 296, pp. 1021–1024, 1983.
- L. Birge, "Estimating a density under order restrictions: nonasymptotic minimax risk", Technical report, UER de Sciences Économiques, Universite Paris X, Nanterre, France, 1983.
- T. Cover, J. Thomas (1991). Elements of Information Theory. pp. 38–42. ISBN 978-0-471-06259-2.
- L. Devroye, A Course in Density Estimation. Progress in probability and statistics, Vol 14. Boston, Birkhauser, 1987. ISBN 0-8176-3365-0, ISBN 3-7643-3365-0.
- Fano, Robert (1968). Transmission of information: a statistical theory of communications. Cambridge, Mass: MIT Press. ISBN 978-0-262-56169-3. OCLC 804123877.
- also: Cambridge, Massachusetts, M.I.T. Press, 1961. ISBN 0-262-06001-9
 
- R. Fano, Fano inequality Scholarpedia, 2008.
- I. A. Ibragimov, R. Z. Has′minskii, Statistical estimation, asymptotic theory. Applications of Mathematics, vol. 16, Springer-Verlag, New York, 1981. ISBN 0-387-90523-5