Learning from examples in feedforward neural networks is studied within a statistical-mechanical framework. Training is assumed to be stochastic, leading to a Gibbs distribution of networks characterized by a temperature parameter T. Learning of realizable rules as well as of unrealizable rules is considered. In the latter case, the target rule cannot be perfectly realized by a network of the given architecture. Two useful approximate theories of learning from examples are studied: the high-temperature limit and the annealed approximation. Exact treatment of the quenched disorder generated by the random sampling of the examples leads to the use of the replica theory. Of primary interest is the generalization curve, namely, the average generalization error εg versus the number of examples P used for training. The theory implies that, for a reduction in εg that remains finite in the large-N limit, P should generally scale as αN, where N is the number of independently adjustable weights in the network. We show that for smooth networks, i.e., those with continuously varying weights and smooth transfer functions, the generalization curve asymptotically obeys an inverse power law. In contrast, for nonsmooth networks other behaviors can appear, depending on the nature of the nonlinearities as well as the realizability of the rule. In particular, a discontinuous learning transition from a state of poor to a state of perfect generalization can occur in nonsmooth networks learning realizable rules.
We illustrate both gradual and continuous learning with a detailed analytical and numerical study of several single-layer perceptron models. Comparing with the exact replica theory of perceptron learning, we find that for realizable rules the high-temperature and annealed theories provide very good approximations to the generalization performance. Assuming this to hold for multilayer networks as well, we propose a classification of possible asymptotic forms of learning curves in general realizable models. For unrealizable rules we find that the above approximations fail in general to predict correctly the shapes of the generalization curves. Another indication of the important role of quenched disorder for unrealizable rules is that the generalization error is not necessarily a monotonically increasing function of temperature. Also, unrealizable rules can possess genuine spin-glass phases indicative of degenerate minima separated by high barriers.