-
8
-
-
84926920569
-
-
H. S. Seung, M. Opper, and H. Sompolinsky, in Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory (COLT '92), Pittsburgh, 1992 (ACM, New York, 1992), pp. 287 294.
-
-
-
-
9
-
-
0011904207
-
-
edited by, S. J. Hanson, J. D. Cowan, C. L. Giles, Morgan Kaufmann, San Mateo, CA
-
(1993)
Advances in Neural Information Processing Systems 5
, pp. 483 490
-
-
Freund, Y.1
Seung, H.S.2
Shamir, E.3
Tishby, N.4
-
20
-
-
84926920568
-
-
As a notational shorthand, we assume that in all probability distributions in which Θ(p) appears, the number of examples p is held fixed, without writing this explicitly. Thus, for example, P(Θ(p)|V) should strictly be written as P(Θ(p)|V,p); hence, it is normalized to 1 when integrating over all possible training sets of size p. To make this convention consistent with the use of Bayes' theorem as in ( refPVthn), we also make the natural assumption that the number of training examples is independent of the teacher rule that we are trying to learn. Thus, P(p|V) = P(p) and hence P(V|p) = P(V), so that we only need one a priori teacher distribution for all values of p.
-
-
-
-
21
-
-
84926920567
-
-
If there is a continuum of teachers V, P(V|Θ(p)) is a probability density which has the dimension of the inverse of V. Strictly speaking, a dimensional normalizing constant is then necessary to make the argument of the logarithm in ( refsv) dimensionless, but we shall not write this explicitly since it cancels from the entropy differences we will be concerned with.
-
-
-
-
26
-
-
84926920566
-
-
The divergence as T to 0 of the term (N/2) ln T in the student space entropy ( reflinpercsnthn) does not present a problem here since we will only be concerned with entropy differences for which this term is irrelevant.
-
-
-
-
28
-
-
84926920565
-
-
Strictly speaking Krogh and Hertz citeKroghetal92 consider a Gaussian distribution for the inputs instead of the spherical distribution ( refxspher), but in the limit N to∞ these produce identical results, as can be checked by a direct calculation of the average eigenvalue spectrum of $M_V along the lines of citeKinzeletal91.
-
-
-
-
30
-
-
84926882822
-
-
W. Feller, Introduction to Probability Theory and Its Applications, 3rd ed. (Wiley, New York, 1970), Vol. 1.
-
-
-
-
31
-
-
84926901547
-
-
For finite but large N, this expression can be estimated to be valid for values of α much smaller than ln N, from results for the mean waiting time in the ``collector's problem'' (see, e.g., Ref. citeFeller70); this ensures that the relative decrease [α+O(1)]/4N is always smaller than 1 as it has to be.
-
-
-
|