LS (actually MLS, Multiple Least Squares) finds a linear combination of all the original regressors (call it X), such that this linear combination is closest to the given response variable (call it y). Hence, the only factor that is considered, i.e. the function to be minimized, is the squared distance (sum of squares of residuals) between the predicted and actual response. The problem with this approach lies in inverting the X'X matrix during LS; if X are strongly correlated, then this matreix may be ill-conditioned, and/or the parameter estimates may have large variances. Hence MLR models, when the number of regressors are high and these regressors are highly correlated, are not robust.
PLS is one of the best alternatives to MLS in giving more robust linear models, particulary for prediction purposes. In PLS, minimizing the squared residuals is not the sole objective, but it is also aimed to find a SINGLE direction in the column space of X, with the highest prediction capacity of y (for the time being, let us still assume that response variable is single dimensional). Actually, NIPALS algorithm is the perfect method to understand what PLS does, but let me just try to summarize the logic behind the NIPALS algorithm here:
1. Among the X column space, find a vector, which is a linear combination of regressor variables.
2. Let this vector (call it w1) explain the regressor space as much as it can.
[This step is similar to PCA]
3. At the same time, the linear correlation between the projection of the X onto w1 (call this projection t1) and the y be as high as it can be
[This step is similar to MLR]
4. Once you find such a direction (the first latent component is found), find the projections of X on this direction, and remove these projected values from the X matrix, and remove the predictions from the y vector, and go to step one. Repeat the whole procedure. By this way, you determine the second, third, etc. latent components.
Note that MLR works only for single dimensional y. If y consists of multiple response variables (this time call it Y), PLS may also be used for these cases. In this case, similar to a direction representative of the bulk variation in X, a vector representative of the bulk variation in Y is found (this is called q1) and Y is projected onto this q1 (the projections are called u1). Now, a regression is done between t1 and u1.
So as as summary, PLS does not regress directly between X and Y, but between t1 and u1, which are the "scores", i.e. the projections along the most "important" directions in X and Y, but also having the highest correlation among these possible directions. The advantage of PLS (besides the fact that it is possible to handle multivariate response vectors simulataneously in PLS but not in MLR) is that the model that we obtain at the end of PLS algorithm has a much smaller number of inner relations (beta parameter in MLR) and hence their estimates have lower variance and the models are more robust compared to that obtained from MLR.