Hello everyone,
I would like to perform machine learning and deep learning in a near futur on clinical data, coming from different sources and technologies (Omics data, Imaging data, and so on).
These data will have to be stored in a structured way in a datawarehouse and will be accessed thanks to a 10Gbit network and I will have some clusters available.
My question is the following : What could be the best architecture to use to run parallel calculations/deep learning/machine learning on these data ? I first though at Spark but would it really be appropriate to wide data (often few observations for lot's of variables - thousands to millions biomarkers analyzed) ?
Is there other framework(s) more adapted to this kind of question ?
What kind of advices would you give me to stick to the best "big data" practice ?
If Spark seems the good answer to my problem, would you use R language (as library sparklyr) or Scala/Python instead to perform deep learning/machine learning ?
Thanks in advance for all your advices,
Best regards