The non-normality of residual can be justified though the linearity and bias as the linearity also has similar logic. Can you please throw more light on this?
I am not certain I understand your question, but if your actual concern is about the usefulness of a regression model, you might use a "graphical residual analysis" to study model fit, and use some kind of cross-validation to determine if you have overfit to a particular sample.
There may be something in the following that you might use:
PS - For regressions of form y = y* + e, most often of use in predictions for finite populations, the following might be of interest to you with regard to heteroscedasticity:
Because you still want to make prediction intervals and need the appropriate distribution to do that. See prediction intervals in least squares regression: https://www.google.com/search?q=prediction+intervals+linear+regression&oq=prediction+intervals&aqs=chrome.3.69i57j0l7.20176j1j8&sourceid=chrome&ie=UTF-8
The regression model estimates the (conditional) mean. The mean is the expected value of the response variable. Look up what the "expected value" is and how it is defined.
For many distributions it turns out that the maximum likelihood of observed data is an estimate of the expected value of the distribution, and that maximizing the likelihood leads to (almost) the same result as minimizing the squared error. If there is no predictor to condition on, this is simply the sample mean (and MLEand LSE conincide exactly). See here foran example using the Poisson distribution (Note that lambda is teh expected value):
The relevant difference is in the uncerainty associated with this estimate, and this is important if you want inferences like confidence or predcition intervals.