I have been pondering about the relationship between these two important topics of our data-driven world for a while. I have bits and pieces, but I have been looking forward to find a neat and systematic set of connections that would somehow (surprisingly) bind them and fill the empty spots I have drawn in my mind for the last few years.
In the past, while I was dealing with multi-class classification problem (not so long ago), I have come to realize that multiple binary classifications is a viable option to address this problem through using error correction output coding (ECOC) - a well known coding technique used in the literature whose construction requirements are a bit different than classical block or convolutional codes. I would like to remind you that grouping multiple classes in two superclasses (a.k.a. class binarization) can be addressed in various ways. You can group them totally randomly which does not dependent on the problem at hand or based on a set of problem-dependent constraints that can be derived from the training data. One way I like the most stays at the intersection point of information theory and machine learning. To be more precise, class groupings can be done based on the resultant mutual information to be able to maximise the class separation. In fact, the main objective with this method is to maximise class separation so that your binary classifiers expose less noisy data and hopefully result in better performance. On the other hand, ECOC framework calls for coding theory and efficient encoder/decoder architectures that can be used to efficiently handle the classification problem. The nature of the problem is not something we usually come across in communication theory and classical coding applications though. Binarization of classes implies different noise and defect structures to be inserted into the so called "channel model" which is not common in classical communication scenarios. In other words, the solution itself changes the nature of the problem at hand. Also the way we choose the classifiers (such as margin-based, etc) will affect the characterization of the noise that impacts the detection (classification) performance. I do not know if possible, but what is the capacity of such a channel? What is the best code structure that addresses these requirements? Even more interestingly, can the recurrent issues of classification (such as overfitting) be solved with coding? Maybe we can maintain a trade-off between training and generalization errors with an appropriate coding strategy?
Similar trends can be observed in the estimation theory realm. Parameter estimations or in the same way "regression" (including model fitting, linear programming, density estimation etc) can be thought as the problems of finding "best parameters" or "best fit", which are ultimate targets to be reached. The errors due to the methods used, collected data, etc. are problem specific and usually dependent. For instance, density estimation is a hard problem in itself and kernel density estimation is one of its kind to estimate probability density functions. Various kernels and data transformation techniques (such as Box-Cox) are used to normalize data and propose new estimation methods to meet today's performance requirements. To measure how well we do, or how different distributions are we again resort to information theory tools (such as Kullback–Leibler (KL) divergence and Jensen-Shannon function) and use the concepts/techniques (including entropy etc.) therein from a machine learning perspective. Such an observation separates the typical problems posed in the communication theory arena from the machine learning arena requiring a distinct and careful treatment.
Last but not the least, I think that there is deep rooted relationship between deep learning methods (and many machine learning methods per se) and basic core concepts of information and coding theory. Since the hype for deep learning has appeared, I have observed that many studies applying deep learning methods (autoencoders etc) for decoding specific codes (polar, turbo, LDPC, etc) claiming efficiency, robustness, etc thanks to parallel implementation and model deficit nature of neural networks. However, I am wondering the other way around. I wonder if, say, back-propagation can be replaced with more reasonable and efficient techniques very well known in information theory world to date.Perhaps, distortion theory has something to say about the optimal number of layers we ought to use in deep neural networks. Belief propagation, turbo equalization, list decoding, and many other known algorithms and models may have quite well applicability to known machine learning problems and will perhaps promise better and efficient results in some cases. I know few folks have already began searching for neural-network based encoder and decoder designs for feedback channels. There are many open problems in my oppinion about the explicit design of encoders and use of the network without the feedback. Few recent works have considered various areas of applications such as molecular communications and coded computations as means to which deep learning background can be applied and henceforth secure performances which otherwise cannot be achieved using classical methods.
In the end, I just wanted to toss few short notes here to instigate further discussions and thoughts. This interface will attract more attention as we see the connections clearly and bring out new applications down the road...