Machine Learning Times
Machine Learning Times
Guidebook to the Future of Data Science: How to Navigate the Increasingly Symbiotic Dynamic Between Executives and Universities
 Book Review of Closing the Analytics Talent Gap: An...
Guilty or Not Guilty: Weight of Evidence
 You have been invited to serve as a juror...
How Machine Learning Works for Social Good
  Originally published in KDnuggets, Nov 2020. This article...
Diversity and Collaborative Problem Solving to Address Wicked Data Ethics Problems
 The complexity of the ethical issues facing professionals who...

2 weeks ago
Gradient Descent Models Are Kernel Machines (Deep Learning)

Originally published in, Feb 7, 2021.

This paper shows that models which result from gradient descent training (e.g., deep neural nets) can be expressed as a weighted sum of similarity functions (kernels) which measure the similarity of a given instance to the examples used in training. The kernels are defined by the inner product of model gradients in the parameter space, integrated over the descent (learning) path.

Roughly speaking, two data points x and x’ are similar, i.e., have large kernel function K(x,x’), if they have similar effects on the model parameters in the gradient descent. With respect to the learning algorithm, x and x’ have similar information content. The learned model y = f(x) matches x to similar data points x_i: the resulting value y is simply a weighted (linear) sum of kernel values K(x,x_i).

This result makes it very clear that without regularity imposed by the ground truth mechanism which generates the actual data (e.g., some natural process), a neural net is unlikely to perform well on an example which deviates strongly (as defined by the kernel) from all training examples. See note added at bottom for more on this point, re: AGI, etc. Given the complexity (e.g., dimensionality) of the ground truth model, one can place bounds on the amount of data required for successful training.

This formulation locates the nonlinearity of deep learning models in the kernel function. The superposition of kernels is entirely linear as long as the loss function is additive over training data.

To continue reading this article, click here.

One thought on “Gradient Descent Models Are Kernel Machines (Deep Learning)

  1. Pingback: Gradient Descent Models Are Kernel Machines (Deep Learning) « Machine Learning Times – NikolaNews

Leave a Reply