Why should weights of Neural Networks be initialized to random numbers?

Machine LearningNeural NetworkArtificial IntelligenceMathematical OptimizationGradient Descent

Machine Learning Problem Overview


I am trying to build a neural network from scratch. Across all AI literature there is a consensus that weights should be initialized to random numbers in order for the network to converge faster.

But why are neural networks initial weights initialized as random numbers?

I had read somewhere that this is done to "break the symmetry" and this makes the neural network learn faster. How does breaking the symmetry make it learn faster?

Wouldn't initializing the weights to 0 be a better idea? That way the weights would be able to find their values (whether positive or negative) faster?

Is there some other underlying philosophy behind randomizing the weights apart from hoping that they would be near their optimum values when initialized?

Machine Learning Solutions


Solution 1 - Machine Learning

Breaking symmetry is essential here, and not for the reason of performance. Imagine first 2 layers of multilayer perceptron (input and hidden layers):

enter image description here

During forward propagation each unit in hidden layer gets signal:

enter image description here

That is, each hidden unit gets sum of inputs multiplied by the corresponding weight.

Now imagine that you initialize all weights to the same value (e.g. zero or one). In this case, each hidden unit will get exactly the same signal. E.g. if all weights are initialized to 1, each unit gets signal equal to sum of inputs (and outputs sigmoid(sum(inputs))). If all weights are zeros, which is even worse, every hidden unit will get zero signal. No matter what was the input - if all weights are the same, all units in hidden layer will be the same too.

This is the main issue with symmetry and reason why you should initialize weights randomly (or, at least, with different values). Note, that this issue affects all architectures that use each-to-each connections.

Solution 2 - Machine Learning

Analogy:

Imagine that someone has dropped you from a helicopter to an unknown mountain top and you're trapped there. Everywhere is fogged. The only thing you know is that you should get down to the sea level somehow. Which direction should you take to get down to the lowest possible point?

If you couldn't find a way to the sea level and so the helicopter would take you again and would drop you to the same mountain top position. You would have to take the same directions again because you're "initializing" yourself to the same starting positions.

However, each time the helicopter drops you somewhere random on the mountain, you would take different directions and steps. So, there would be a better chance for you to reach to the lowest possible point.

This is what is meant by breaking the symmetry. The initialization is asymmetric (which is different) so you can find different solutions to the same problem.

In this analogy, where you land is the weights. So, with different weights, there's a better chance of reaching to the lowest (or lower) point.

Also, it increases the entropy in the system so the system can create more information to help you find the lower points (local or global minimums).

enter image description here

Solution 3 - Machine Learning

The answer is pretty simple. The basic training algorithms are greedy in nature - they do not find the global optimum, but rather - "nearest" local solution. As the result, starting from any fixed initialization biases your solution towards some one particular set of weights. If you do it randomly (and possibly many times) then there is much less probable that you will get stuck in some weird part of the error surface.

The same argument applies to other algorithms, which are not able to find a global optimum (k-means, EM, etc.) and does not apply to the global optimization techniques (like SMO algorithm for SVM).

Solution 4 - Machine Learning

As you mentioned, the key point is breaking the symmetry. Because if you initialize all weights to zero then all of the hidden neurons(units) in your neural network will be doing the exact same calculations. This is not something we desire because we want different hidden units to compute different functions. However, this is not possible if you initialize all to the same value.

Solution 5 - Machine Learning

>1. Wouldn't initializing the weights to 0 be a better idea? That way the weights would be able to find their values (whether positive or negative) faster? > >2. How does breaking the symmetry make it learn faster?

If you initialize all the weights to be zero, then all the the neurons of all the layers performs the same calculation, giving the same output and there by making the whole deep net useless. If the weights are zero, complexity of the whole deep net would be the same as that of a single neuron and the predictions would be nothing better than random.

Nodes that are side-by-side in a hidden layer connected to the same inputs must have different weights for the learning algorithm to update the weights.

By making weights as non zero ( but close to 0 like 0.1 etc), the algorithm will learn the weights in next iterations and won't be stuck. In this way, breaking the symmetry happens.

>3. Is there some other underlying philosophy behind randomizing the weights apart from hoping that they would be near their optimum values when initialized?

Stochastic optimization algorithms such as stochastic gradient descent use randomness in selecting a starting point for the search and in the progression of the search.

The progression of the search or learning of a neural network is known as convergence. Discovering a sub-optimal solution or local optima result into premature convergence.

Instead of relying on one local optima, if you run your algorithm multiple times with different random weights, there is a best possibility of finding global optima without getting stuck at local optima.

Post 2015, due to advancements in machine learning research, He-et-al Initialization is introduced to replace random initialization

w=np.random.randn(layer_size[l],layer_size[l-1])*np.sqrt(2/layer_size[l-1])

The weights are still random but differ in range depending on the size of the previous layer of neurons.

In summary, non-zero random weights help us

  1. Come out of local optima
  2. Breaking the symmetry
  3. Reach global optima in further iterations

Solution 6 - Machine Learning

Let be more mathematical. In fact, the reason I answer is that I found this bit lacking in the other answers. Assume you have 2 layers. If we look at the back-propagation algorithm, the computation of

dZ2 = A2 - Y

dW2 = (1/m) * dZ2 * A2.T

Let's ignore db2. (Sorry not sorry ;) )

dZ1 = W2.T * dZ2 .* g1'(Z1)

...

The problem you see is in bold. Computing dZ1 (which is required to compute dW1) has W2 in it which is 0. We never got a chance to change the weights to anything beyond 0 and we never will. So essentially, the neural network does not learn anything. I think it is worse than logistic regression (single unit). In the case of logistic regression, you learn with more iterations since you get different input thanks to X. In this case, the other layers are always giving the same output so you don't learn at all.

Solution 7 - Machine Learning

In addition to initialization with random values, initial weights should not start with large values. This is because we often use the tanh and sigmoid functions in hidden layers and output layers. If you look at the graphs of the two functions, after forward propagation at the first iteration results in higher values, and these values correspond to the places in the sigmoid and tanh functions that converge the derivative to zero. This leads to a cold start of the learning process and an increase in learning time. As a result, if you start weights at random, you can avoid these problems by multiplying these values by values such as "0.01" or "0.001".

Solution 8 - Machine Learning

I learned one thing: if you initialize the weight to zeros, it's obvious that the activation units in the same layer will be the same, that means they'll have the same values. When you backbrop, you will find that all the rows of the gradient dW are the same also, hence all the rows of the weight matrix W are the same after gradient descent updates. In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, and you might as well be training a neural network with n[l]=1n[l]=1 for every layer, and the network is no more powerful than a linear classifier such as logistic regression. Andrew Ng course:

Solution 9 - Machine Learning

First of all, some algorithms converge even with zero initial weightings. A simple example is a Linear Perceptron Network. Of course, many learning networks require random initial weighting (although this is not a guarantee of getting the fastest and best answer).

Neural networks use Back-propagation to learn and to update weights, and the problem is that in this method, weights converge to the local optimal (local minimum cost/loss), not the global optimal.

Random weighting helps the network to take chances for each direction in the available space and gradually improve them to arrive at a better answer and not be limited to one direction or answer.

[The image below shows a one-dimensional example of how convergence. Given the initial location, local optimization is achieved but not a global optimization. At higher dimensions, random weighting can increase the chances of being in the right place or starting better, resulting in converging weights to better values.][1]

[1]: https://i.stack.imgur.com/2dioT.png [Kalhor, A. (2020). Classification and Regression NNs. Lecture.]

In the simplest case, the new weight is as follows:

W_new = W_old + D_loss

Here the cost function gradient is added to the previous weight to get a new weight. If all the previous weights are the same, then in the next step all the weights may be equal. As a result, in this case, from a geometric point of view, the neural network is inclined in one direction and all weights are the same. But if the weights are different, it is possible to update the weights by different amounts. (depending on the impact factor that each weight has on the result, it affects the cost and the updates of the weights. So even a small error in the initial random weighting can be solved).

This was a very simple example, but it shows the effect of random weighting initialization on learning. This enables the neural network to go to different spaces instead of going to one side. As a result, in the process of learning, go to the best of these spaces

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionShayan RCView Question on Stackoverflow
Solution 1 - Machine LearningffriendView Answer on Stackoverflow
Solution 2 - Machine LearningInanc GumusView Answer on Stackoverflow
Solution 3 - Machine LearninglejlotView Answer on Stackoverflow
Solution 4 - Machine LearningSafak OzdekView Answer on Stackoverflow
Solution 5 - Machine LearningRavindra babuView Answer on Stackoverflow
Solution 6 - Machine LearningMuhammad Mubashirullah DurraniView Answer on Stackoverflow
Solution 7 - Machine LearningmustafamuratcoskunView Answer on Stackoverflow
Solution 8 - Machine LearningabdoulsnView Answer on Stackoverflow
Solution 9 - Machine Learningmohammad javadView Answer on Stackoverflow