I have RNA-Seq samples from an experiment in which I infected epithelial cells with bacteria. The experiment was repeated identically 3 times and then once more with slightly altered parameters (higher bacterial inoculum and longer incubation time). Each experiment also had certain treatment groups. My goal is to determine DEGs between those groups.

When comparing the gene expression of all samples within their groups (intragroup correlation), some of them are clearly different from the rest of the group. Sometimes they also form two distinct clusters (see examples attached, blue color in the scale bar indicates low difference/high correlation). However, these "outlier" samples do NOT necessarily stem from the experiment with altered parameters.

How should I proceed with my data analysis?

  • Should I simply exlcude all samples from that 4th experiment because it did not follow the exact same protocol?
  • Should I determine a minimum correlation below which I exclude samples regardless of which experiment they stem from? If so, is there a specific methodology for this?
  • Should I not exclude any samples?
  • My concern is that I will miss otherwise significantly regulated genes because the samples within each group are so heterogenous.

    More Sven Cleeves's questions See All
    Similar questions and discussions