I want to preprocess a dataset for them to have information, e.g. the occurrence of each value. Which artificial intelligence technique would you recommend?
I also suggest the normalization of the data (the values of the attributes) specifically if you are using any ML algorithms that are based on information gain
Nihad, your question is valid, you want to check old data to contrast a possible new interpretation. The goal "the occurrence of each value" is similar to find the CDF after you order data in descending or in ascending sense. If you may build a Lorenz curve as a data table, or as a decils table plus the value of the mean, then show it here without explaining units neither research details, so we may reprocess it. The method I use is a non-linear no parametric one for univariate distributions, so it may help for these particular cases. But doing it for the main variable -selected according to your criteria- is always the first step for any multivariable analysis, in my opinion. Thanks, emilio
I don't see any Artificial Intelligence when all you have to do is to find either experimental Probability Density Function (PDF) or Cumulative Density Function (CDF), for a single variable. Or both. Why those 'big words' then?
Well, Usually the first step is to generate some statistical information about data, stating from just mean and variance to what ever probability density function that can model the data and give insight to the information on it,
Next step if you need to get more information using Artificial Intelligence is to use clustering algorithms to group the data points into sets that have some latent relation between each other (example: if your data is purchase details for thousands of customers then clustering will give you market segments and different customer profiles)
Next step might be to use anomaly detection algorithms to detect outliers in the data set which also falls in the category of Artificial intelligence.
All the previous steps assumed that you don't have insight about the structure of the data, of course if there is known information about this data-set then instead of the techniques I recommended that falls into Unsupervised learning techniques you might consider supervised learning techniques.