Hello everyone,

I would like to perform machine learning and deep learning in a near futur on clinical data, coming from different sources and technologies (Omics data, Imaging data, and so on).

These data will have to be stored in a structured way in a datawarehouse and will be accessed thanks to a 10Gbit network and I will have some clusters available.

My question is the following : What could be the best architecture to use to run parallel calculations/deep learning/machine learning on these data ? I first though at Spark but would it really be appropriate to wide data (often few observations for lot's of variables - thousands to millions biomarkers analyzed) ?

Is there other framework(s) more adapted to this kind of question ?

What kind of advices would you give me to stick to the best "big data" practice ?

If Spark seems the good answer to my problem, would you use R language (as library sparklyr) or Scala/Python instead to perform deep learning/machine learning ?

Thanks in advance for all your advices,

Best regards

More Nathalie Jeanray's questions See All
Similar questions and discussions