This is a very simplified explanation based on G. Hinton‘s and co-authors‘ paper, that started all of this. What could G. Hinton say, if he could have time for such an explanation?
„The AI system has parts that remind slightly connected neurons, passing through signals from input to the output. For the AI system to learn something, it has to change coefficients that reflect the strength of the link between two connected neurons. Let‘s say we want the AI to learn to recognize handwritten „1“ . Before learning, coefficients are random. After the information passes through the neurons, the AI gives the probability that what is given to it is „1“ and probabilities that it is something else.
Let‘s say it gives a probability that it is „1“ only 6 percent. Very important number is a difference between the right answer, calculated that it should be 100 percent and answer given by the AI: 6 percent. Let‘s call this difference delta.
This difference, 100 – 6 equals 94, should be minimized to zero. For that AI goes back and gives the strength coefficients small changes, trying to minimize difference. Speed of this process with many neurons is important too, so the computer calculates the slope of the graph with a curve of delta dependence from coefficient strength and chooses the path where the slope is the steepest, the speed of descent is the fastest. The learning process is repeated until the system gives good enough probability that it found „1“, let‘s say 99.9 percent probability.
The paper's most important discovery was that given a sufficiently complicated AI system, with enough neurons, the process doesn‘t get stuck in local minima, reaching real minimum most of the time.
Is that right?
What could be my response to G. Hinton‘s explanation?
„Yes, this is an excellent and accurate simplified explanation of how artificial neural networks learn, heavily based on the backpropagation algorithm popularized by Geoffrey Hinton, David Rumelhart, and Ronald Williams in their seminal 1986 paper.
Here is a breakdown of why your explanation is correct, with some technical nuances to make it complete:
1. The Structure (Neurons and Weights)
Your statement: "...parts that remind slightly connected neurons, passing through signals... change coefficients that reflect the strength of the link."
Why it's right: Artificial neurons take inputs, multiply them by weights (the coefficients or strength of the link), sum them up, and pass them through an activation function to the next layer.
Key Insight: Learning is indeed just the process of updating these weights to map inputs (pixel values of a 1) to the correct output.
2. The Learning Process (Forward & Backward Pass)
Your statement: "...before learning coefficients are random... AI gives the probability... difference between the right answer... and answer given by the AI: 6 percent. Let‘s call this difference delta."
Why it's right: This is the core of supervised learning.
Forward Pass: Data goes in, random weights produce a random output (6% confidence of it being a 1).
Loss Function (Delta): The "difference" you mentioned is called the Loss Function (or Cost Function). It measures how bad the AI is at guessing, often calculating the difference between the desired output (100% or 1) and the predicted output.
3. The Optimization (Steepest Slope)
Your statement: "...AI goes back and gives the strength coefficients small changes... calculates the slope... steepest, the speed of descent is the fastest."
Why it's right: This is Gradient Descent combined with Backpropagation.
Backpropagation: The algorithm calculates how much each weight contributed to the error (delta) by going backward from output to input.
Steepest Descent: It computes the gradient (slope) of the error function. To minimize the error, the AI changes the weights in the opposite direction of the slope, acting like a ball rolling down a hill to the lowest point.
4. Local Minima vs. Real Minimum
Your statement: "...given a sufficiently complicated AI system, with enough neurons, the process doesn‘t get stuck in local minima, reaching real minimum most of the time."
Why it's right: For a long time, it was feared that neural networks would get stuck in "local minima"—small, shallow valleys in the error graph that are not the lowest possible point.
Hinton's Insight/Modern Understanding: In high-dimensional spaces (a large system with millions of connections), true local minima are rare. Instead, the network encounters "saddle points"—areas where the surface is flat but not a minimum—or many local minima that are actually very close to the overall optimal "global" minimum. Thus, a complex system can usually find a very good solution.
Summary
Your explanation describes gradient descent via backpropagation on a feedforward neural network flawlessly in simple terms. It is the foundation of modern deep learning, allowing systems to learn from data rather than being hard-coded.“
Komentarų nėra:
Rašyti komentarą