Hi!
These are probably very basic questions, but my knowledge of statistics is basic too (I use it as a tool everyday, but when new problems appear I don't know all the underlying issues when choosing one solution over another).
In an ecological study, I am interested in pointing out which variables in the environment best explain (or even predict) the abundance of a set of species, in order to classify sampling units into groups according to these chosen environmental parameters, with a certain degree of confidence that those units will have similar species abundances. So I have 2 datasets: one with the species abundances per sampling unit, and the other with various environmental variables for the same sampling units.
My questions concern not so much the analysis itself, but the pre-treatment of data, since I am very skeptical about the conclusions I may infer about the "real" values if I use, for instance, the fourth root of a value and then normalize everything...
I will divide this into 2 questions:
1) In order to analyse the environmental data in multivariate space, I will need to calculate a distance measure (suggestions? Euclidean distance is a common choice for this type of data after transformations...), but environmental data has variables with very different ranges. Some variables go from 0.001 to 0.05 (such as a density), others from 1 to 200 (such as a distance or an area). So what transformations should I apply? Is normalizing sufficient? If I analyse each one of my 50 variables one by one and apply individual transformations, I think the interpretation of results will be very confusing!
I understand that I will have to reduce the number of variables prior to transformations, by eliminating the ones that are highly correlated, but anyway, I may end up with half of them or so...
2) If I do transform and normalize the data, so that every value is in the range 0-1, what are the implications for prediction? Because normalization is relative to the data I have. If a multivariate regression model tells me that variables A and B are important as predictors, when I introduce a new sampling unit for classification into the existing groups, how will I have to transform the values for variables A and B?
I have read a great amount of similar studies, but they all apply different analyses and pre-treatments and there doesn't seem to exist a "magical recipe" or even a consensus, so I wonder if anyone here could give me some tips.
Thank you very much for any help you can give and I am sorry for the long text...