I am working on a Data Mining Project that will involve using Neural Network to predict some atmospheric variables. The works involves Big Atmospheric Data and am trying to adopt a data structure that will be suitable.
Third, now you need to choose your features as input of neural network. Based on your experience or better to use correlation, choose which one have high correlation,
On the one hand, the variables chosen in your database must have a considerable effect on the studied phenomenon . And on the other hand, these variables must be independent; there are no correlations between them.
As Yasmina Kellouche Yasmina Kellouche said, you should suspect that the single feature influences the output. The independence of the types of inputs is not a must (if you have a sufficient number of input datasets).
Sometimes the types of data that seems not influencing the output are important (in fact their influence is considerable when they appear together with other types of data). It can be checked (the joint influence) through association analysis.
Which type of data should be chosen, it can be checked through Principal Component Analysis.
There is an empirical way of choosing the types of input data: you are taking only the most important ones, crating the ANN and adding one by one other types of data (creating new ANNs) observing the errors of predictions. You stop when the errors start to rise. The inverse procedure is also applied (taking all types of input data and then reducing their number).
Yes. We both are right. The process of chosing the set of data we continue for lowering the errors. When any type of data is added (or removed) and it caused the error increase, we stop adding (or removong) the types of input data, staying with these which has produced the lowest errors.
Ibrahim Aishat Musa . You mentioned that you are going to predict variables (are there multiple variables you are predicting?). The simplest structure would always be a fixed set of features with well defined labels. Even if they are not in the right structure, you can always do some preprocessing (with extra pain) and fix it; but most important part would be choosing data that makes sense (and not put everything and use black box to predict).
My experience with ANN algorithms is that removal of variables does not necessarily improve performance since the neural network identifies the importance of the variables and attributes a suitable weight. In fact, neural networks work best with large number of variables.
I would advise three specifics. Firstly, ensure that your standardisation is accurate and specifically for your target variable(Logarithmic...etc). Secondly, run the algorithm across all your variables first to see how the data is handled and provides you with basis to work from. Finally, run Principle Factor Analysis and correlation tests between target variable and input variables as well as tests between the input variables themselves. For the latter you look for highly correlated variables and make a judgement as to which variable to remove, for the former you make a judgement as to which variables have a low correlation with the target variable and should be removed. You may need to attempt various
P.S make sure you understand how the data(instances) is distributed by using visuals, as skewed data will produce poor test results and low accuracy.