Which Machine learning algorithm is most suitable for predicting interactions between two different genomic datasets using their expression values as the determining factor ?
I do believe that your main concern is whether to chose a Generative model algorithm or a Discriminative model algorithm. In your case, since the two datasets are not in joint distribution, discriminative models will have a superior performance. The discriminative model provides a model only for the target variable conditional on the observed variables.
Generative models are far more flexible, yes. However, for your analysis, it won't stand optimal.
An example of the Discriminative model algorithm is the Support Vector Maching (SVM). I am sure you can find a lot of papers that have used SVM to study protein protein interactions.
You can find a lot of machine learning systems for data classification. But, it is worthy of consideration that we can categorize all of them into two main groups of local and global classification systems. Based on each dataset, features vector and size of dataset a local or global classifier can make a better model for final classification. Hence, it would be logical if you train and test one local and one global classifier on your data to find out which one makes better results. KNN and SVM are kind of state of the arts in this area.
KNN is an instance-based “lazy” machine learning method to build an optimal classification function locally. and SVM is an “eager” machine learning method, which is trained using the entire training samples to build a global model of fitting the training data.
Therefore, I recommend you to make model for your data by both of them to find out which one is better.
Note: If you are using Support Vector Machine, picking up its kernel is very important so i recomend checking out your model with different kernels. Radial, Polykernel, RBF and others.
Linear and nonlinear classifiers based on the Kernel trick are appropriate tools to solve your problem. I suggest that you follow these references as a good starting point:
Lanckriet et al., "A statistical framework for genomic data fusion", 2004 - https://pdfs.semanticscholar.org/32e2/3cb933ed9b23d7da6da79645b8cd173ef68e.pdf
Ben-Hur et al., "Support Vector Machines and Kernels for Computational Biology", 2012 - http://www.mgene.org/lectures/MLSSKernelTutorial2012/text.pdf
Khondoker et al., "A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies", 2013 - http://journals.sagepub.com/doi/pdf/10.1177/0962280213502437
Neelima et al., "A comparative Study of Machine Learning Classifiers over Gene expressions towards Cardio Vascular Diseases Prediction ", 2017 - https://www.ripublication.com/ijcir17/ijcirv13n3_07.pdf
Deep learning has also a great potential in solving predictive problems. Follow:
Xie et al., "A Predictive Model of Gene Expression Using a Deep Learning Framework", 2016 - http://calla.rnet.missouri.edu/cheng/dn_gene_expression.pdf
Valuable references on computationally-efficient Deep Learning based on Extreme Learning Machines (ELM) paradigm:
Parkavi et al., "Recent Trends in ELM and MLELM: A review", 2017 - https://www.astesj.com/publications/ASTESJ_020108.pdf
Cambria et al., "Extreme Learning Machines", 2013 - https://pdfs.semanticscholar.org/19c3/7e976fd302e9cfc2de8a3adddb4ce48c4335.pdf
Huang et al., "Trends in Extreme Learning Machines: A Review", 2014 - https://www.researchgate.net/publication/267339744_Trends_in_Extreme_Learning_Machines_A_Review
Many reviews exist by searching for "microarray gene expression data" keywords. Follow as samples:
Pirooznia et al., "A comparative study of different machine learning methods on microarray gene expression data", 2008 - https://www.researchgate.net/profile/Mehdi_Pirooznia/publication/5485070_A_comparative_study_of_different_machine_learning_methods_on_microarray_gene_expression_data/links/09e41509c0eb53d07e000000.pdf
Ding et al., "Minimum redundancy feature selection from microarray gene expression data", 2005 - https://pdfs.semanticscholar.org/57e3/ddb142afd7dde5b4f39ee47e8e057d996bdb.pdf
Note: if needed, Thompson et al. developed a cross-platform normalization of microarray and RNA-seq data for machine learning applications. Follow:
Thompson et al., "Cross-platform normalization of microarray and RNA-seq data for machine learning applications", 2016 - https://peerj.com/articles/1621.pdf
Best Regards
Article A comparative study of different machine learning methods on...
Article Trends in Extreme Learning Machines: A Review
Have you considered a form of generative meta-learning/hyper-heuristics. These techniques generates some general algorithms using some specific operators.
Of course, Bayesian approach may be used to solve your problem. You may have a look to (and its related bibliography) "Controlling for Confounding Effects in Single Cell RNA Sequencing Studies using Both Control and Target Genes" by Chen et al., 2016:
The Machine learning approach that you will use will also depend on the quality of data. By quality of data, i mean the skew in the data (if any), the instances, the training dataset volume among other factors.
The approach that you have to use will also depend on the fact that whether you want decision trees to be employed, whether you want a set of rules or whether the discrimination should be an instant one(black box approach).
Among others, Random forest happens to be a good contender, followed by J48 decision tree.SVM is a bit tricky due to kernel problems and the tweaks it has.
Have you tried WeKa for implementation? It may come handy with its best classifier detection strategy.
WEKA is fully documented in "Data Mining-Practical Machine Learning Tools and Techniques" by I.H. Witten and E. Frank, 2nd Edition, 2005 available from: