What is the most appropriate data mining technique for "Big Data"?

More Raphaël Feraud's questions See All

Do we really need to store billions of user's information ?

For a lot of applications, such as recommendation, adserving, personnalization, marketing optimization, user's information (navigation logs, service logs, calls, purchases...) is stored and...

03 April 2016 5,300 16 View

Do you know a dataset which can be used for contextual bandits?

For instance: - datasets for supervised classification, with a lot instances to simulate a data stream and many classes to simulate the playable actions, - or datasets with contexts, actions and...

08 September 2014 9,240 1 View

Occam's razor is necessary to physics and to machine learning. Is this strange?

Galileo has deduced the law of gravity (1/2 g t^2) by observing balls rolling on an inclined plane. However, without the occam razor, there is no reason to infirm the following law: until today...

09 October 2013 8,701 37 View

Why deep learners work so well?

Deep learners report very good results on a lot of difficult learning tasks such as Speech recognition, Object Recognition, Natural Language processing, Transfer learning... The hidden layers are...

06 July 2013 8,168 6 View

What is the best algorithm for online supervised classification ?

We consider a data stream which is not necessary stationary. We would like to label instance on the fly.

03 April 2013 6,745 4 View

What are the books which have changed or influenced your work ?

"The nature of statistical learning theory" was my favorite book during my PhD, and several years after. I still use "Elements of Information Theory" in my work. Recently a book has opened new...

03 April 2013 5,378 32 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

Do you know best mines of western part of Afghanistan?

I want to know more about Mn deposits in west of Afghanistan.

07 August 2024 3,427 1 View

Is Galaxy.org good to use for research for analyzing data and for publication?

Hello all, I wanted to know, can I use galaxy (USA, Europe or Australia) platform for analyzing the shotgun data, and can it be used for publication purpose as well? Thanks :)

06 August 2024 6,610 4 View

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

05 August 2024 8,836 2 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

Raphaël Feraud

You are right.

We can address the interpretation problem and the classification problem together.

However, we can also separate the two problems: use models to predict behaviors, and models to interpret these behaviors.

Anyway, one of the claims of "Big Data" is to increase the performances of the predictors. Then, to achieve this goal, the right way is online learning on the data stream or bagging in a Hadoop system ?

It is an important question, because the IT tools are really different. In one hand we have a Data Stream Management System and in the other hand we have an Hadoop cluster.

Jonathan D. Fraine

Although the question seems direct, the answer reveals that it is complex. It strongly depends on the scale and complexity of the problem at hand.

In physics laboratories, linear regression can still be a functional machine learning technique -- "all other factors held constant". But the more correlated, complex, and high-dimensional the problem's input features become, the less likely that simple assumptions (such as linearity) will remain valid.

If you want a catch-all 'starter' ML algorithm to try against large numbers of different problems in different spaces, then I would use random forests -- it's my goto for a baseline when comparing models for a new problem. The solution is always 'good enough' to start with, but sometimes not subtle or versatile enough. I mostly do regression analysis, where RF is "okay"; but RF's are great as a starter solution set for classification.

The real answer is the apply a broad spectrum of models within sub-spaces of the data to understand the implicit correlations between dimensions and the versatility of each ML method to tackle blocks of data or sets of features. Then increase complexity and feature space volume (n_features x n_samples) until the full solution is 'at scale'.

In a sense, you need to 'converge' to a solution over many iterations of both ML methods and feature sub-spaces that asymptotically approach the full solution.