Hello, everyone!
In recent time I am learning an online course called "learning from data". The lecturer showed the below learning diagram you can find from this website: http://work.caltech.edu/slides/slides02.pdf. It's very interesting and impressive for me.
My question is based on the diagram.
As the lecturers said, the ultimate goal of machine learning is to find an optimal hypothesis from data that can replace target function which is impossible to learn. For a regression problem in machine learning, except for data quality issues such as outliers and missing values, imbalanced data is an import issue as well. Considering the fact that data quality is one of the most important aspects that can influence the final hypothesis that we may find, I am thinking that if it is helpful for us to learn probability distribution of input data when we apply machine learning methods like ANN for solving the issue: imbalanced data. Of course, it is based on the assumption that probability distribution of input data is learnable.
1. So the first question: is probability distribution of input data learnable?
2. The second question: Is it possible for us to obtain any insight from probability distribution of input data to solve the issue: imbalanced data?
Notes: Supposed input data x is generated by a unknown distribution P(x), written as x∼P(x).
Thanks for your answers!