Neural networks secretly perform classic statistical inference, study reveals
A new study reveals a deep link between neural network training and a classic statistical method. Researchers have shown that many modern machine learning models implicitly perform Expectation-Maximization (EM), a technique long used in probabilistic modelling. This connection helps explain why neural networks behave like inference engines—grouping data, quantifying uncertainty, and specialising in patterns—without being explicitly programmed to do so.
The relationship between neural networks and EM was first explored decades ago. In the late 1980s and 1990s, Geoffrey Hinton and collaborators demonstrated this link in models like Boltzmann machines and Helmholtz machines. Later, researchers such as Michael Jordan, Zoubin Ghahramani, and Radford Neal expanded the theory, showing how neural learning often mirrors EM in latent-variable models.
Alan Oursland’s recent work builds on these insights. His findings confirm that when neural networks minimise loss functions involving distances or energies, they implicitly carry out probabilistic inference. The forward pass acts like the E-step in EM, inferring hidden structure, while the backward pass resembles the M-step, refining model parameters. The connection runs deeper than analogy. The gradient of the loss function with respect to each distance term corresponds exactly to the negative posterior responsibility of a component. In mathematical terms, ∂L/∂dj = −rj, meaning the optimisation process itself computes responsibilities—no separate inference algorithm is needed. This unifies seemingly distinct methods, from Gaussian Mixture Models to attention mechanisms and cross-entropy classification, under a single framework. Traditional EM requires alternating between inferring latent variables and updating parameters. Neural networks, however, perform both steps simultaneously through gradient descent. The responsibilities—weights assigned to different components—emerge directly from the gradients during training.
The discovery clarifies why neural networks exhibit behaviours like probabilistic inference, even when trained with standard backpropagation. By framing gradient descent as implicit EM, researchers can now interpret a wide range of learning procedures through a shared lens. This insight may lead to more principled designs for models that handle uncertainty, specialisation, and data grouping in a mathematically grounded way.