In some cases, we need all the partial derivatives of a multi-variable function. If it is a scalar function (as usual), the collection of the first partial derivatives is called a Gradient. If it is a vector-valued multi-variable function, the collection of the first partial derivatives is called the Jacobian matrix.

In some other cases, we just need a partial derivative, just respect to one specific variable.

Here is where my problem starts:

In neural networks, the gradient of the loss function with respect to individual parameters (for example: ∂L/∂w11 where w11 represents the first weight​ of the first layer)

can theoretically be computed directly, using the chain rule without explicitly relying on Jacobians, In my opinion. By tracing the dependencies of a single weight through the network, it is possible to compute its gradient step by step. Because all the functions in the individual neurons, are scalar functions. Involving scalar relationships with individual parameters. Without the need to consider all the Linear Transformations across the layers.

An example chain rule representation for 1 layer network:

∂L/∂w11 = ∂L/∂a11 * ∂a/∂z11 * ∂z/∂w11 It can be applied to multiple-layer networks.

However, it is noted that Jacobians are necessary when propagating gradients through entire layers or networks because they compactly represent the relationship between inputs and outputs in vector-valued functions. But this requires all the partial derivatives, instead of one.

This raises a question: if it is possible to compute gradients directly for individual weights, why are Jacobians necessary in the chain rule of the backpropagation? Why do we need to compute all the partial derivatives at once?

I am waiting for your response. #DeepLearning #NeuralNetworks #MachineLearning #MachineLearningMathematics #DataScience #Mathematics

More Oğuzhan Memiş's questions See All
Similar questions and discussions