I have morphological data, containing continuous and categorical data. Please some one help me to choose appropriate algorithm for genetic distance calculation?
some common distance functions are Euclidean distances for continuous variables, and Hamming distances for categorical variables. You have a mixture of the categorical and continuous variables, though.
One solution is to use distance functions designed for both types of variables, like for instance:
- Mahalanobis distances (e.g. see McCane, Brendan, and Michael Albert. "Distance functions for categorical and mixed variables." Pattern Recognition Letters 29, no. 7 (2008): 986-993.)
-Gowers (di s)similarity coefficient
Alternatively, you could look at "PCA" like approaches which, though they don't directly compute "distances", are used to cluster observations and can be transformed (with a little algebra) to distances. For mixed categorical/continuous variables, you may look at Factor Analysis .
Thanks Filippo for answering my question. Mahalanobis distances and Gowers (di s)similarity coefficient are new to me for distance calculation, so i will look these alorithm. I am using R for analysis. Is these algorithm available in R?
As I was not aware of above suggested algorithm, currently I am converting the categorical observations into numerical values for calculation. For example, trait leaf color observation was taken in three categories; Green, Light green, pale green so I am replacing the categorical data with numerical values by giving them code for example Green=1, Light Green =2 and pale green =3 etc. So l am converting all categorical trait data into numerical values and then using Manhattan for distance calculation in R or Darwin software.
Is this correct way to tackle the mix of continuous and categorical data or we should use only specially designed algorithm like you suggest above for the same.
Treating unordered categorical data as numeric is dangerous, since you are introducing an order (0, 1, 2 etc ...), where there is none. Even if categories are ordered, intervals may be uneven, which is not reflected in equally spaced integers.
I would suggest to go for methods appropriate for mixed continuous and categorical variables, since this is the nature of you problem. Then again, this ultimately depends on your objective: if you need something quick and dirty, you can go with the "numerical" conversion. Otherwise it may be advisable to invest some time into doing things properly.
Thanks Filippo, you cleared my concepts on this topic and as I want to be precise on my analysis, I will use the algorithm suggested by you.
Is it ok to use Jaccard coefficient for genetic distance calculation using molecular marker data in binary format (converting alleles into 1/0 binary format)? Is there any more appropriate algorithm for marker data?