Machine Learning Times
Machine Learning Times
EXCLUSIVE HIGHLIGHTS
Three Best Practices for Unilever’s Global Analytics Initiatives
    This article from Morgan Vawter, Global Vice...
Getting Machine Learning Projects from Idea to Execution
 Originally published in Harvard Business Review Machine learning might...
Eric Siegel on Bloomberg Businessweek
  Listen to Eric Siegel, former Columbia University Professor,...
Effective Machine Learning Needs Leadership — Not AI Hype
 Originally published in BigThink, Feb 12, 2024.  Excerpted from The...
SHARE THIS:

3 years ago
Gradient Descent Models Are Kernel Machines (Deep Learning)

 
Originally published in infoproc.blogspot.com, Feb 7, 2021.

This paper shows that models which result from gradient descent training (e.g., deep neural nets) can be expressed as a weighted sum of similarity functions (kernels) which measure the similarity of a given instance to the examples used in training. The kernels are defined by the inner product of model gradients in the parameter space, integrated over the descent (learning) path.

Roughly speaking, two data points x and x’ are similar, i.e., have large kernel function K(x,x’), if they have similar effects on the model parameters in the gradient descent. With respect to the learning algorithm, x and x’ have similar information content. The learned model y = f(x) matches x to similar data points x_i: the resulting value y is simply a weighted (linear) sum of kernel values K(x,x_i).

This result makes it very clear that without regularity imposed by the ground truth mechanism which generates the actual data (e.g., some natural process), a neural net is unlikely to perform well on an example which deviates strongly (as defined by the kernel) from all training examples. See note added at bottom for more on this point, re: AGI, etc. Given the complexity (e.g., dimensionality) of the ground truth model, one can place bounds on the amount of data required for successful training.

This formulation locates the nonlinearity of deep learning models in the kernel function. The superposition of kernels is entirely linear as long as the loss function is additive over training data.

To continue reading this article, click here.

One thought on “Gradient Descent Models Are Kernel Machines (Deep Learning)

  1. Pingback: Gradient Descent Models Are Kernel Machines (Deep Learning) « Machine Learning Times – NikolaNews

Leave a Reply