Hello, everyone. I obtained DEGs from RNAseq analysis for normal and infected samples. Then I decreased the number of them by some downstream analysis. Now I have 120 DEGs, and I want to select between them the best combination of biomarkers that can recognize normal from infected samples (biomarker panel). So I want to use machine learning methods (At first, I want to perform feature selection and then draw ROC curve, count MCC, Spe, Sen, ....for the combined set of selected biomarkers by different algorithms such as the neural network and random forest). Because I don't have experience in machine learning, I have some questions. And please let me know if you think I am doing any steps that explain here wrong!
1- What kind of RNAseq files should I enter into machine learning software? count file, FPKM, tpm, or any other files?
2- Should that be normalized?
3- Should the entry be log2 transformed?
4- Can the training and discovery dataset be the same?
5- Is what I write below a correct study design?: The use of a dataset for obtaining DEGs then, partitioning it into k subsets of equal size. Of the k subsets, a single subset is retained as the test data set. The remaining k - 1 subset is used as training data sets. The cross-validation process is then repeated k times, with each of the k subsets used exactly once as the test data. The k results from the k iterations are averaged (or otherwise combined) to produce a single estimation. And then performing a test for the model with an external dataset to validate the model.
6- Can the validation dataset be from a different technology like microarray? Is any pre-processing needed for the datasets to be tuned before performing machine learning methods in this case?
Thank you to answer my questions