1. # Why Mean Squared Error and L2 regularization? A probabilistic justification.

When you solve a regression problem with gradient descent, you’re minimizing some differentiable loss function. The most commonly used loss function is mean squared error (aka MSE, $\ell_2$ loss). Why? Here is a simple probabilistic justification, which can also be used to explain $\ell_1$ loss, as well as $\ell_1$ and $\ell_2$ regularization.