Hello, for calculating the gentic distance using phenotyping data, which algorithm is suitable? I have mix of continuous and categorical trait data?

More Anshuman Tiwari's questions See All

Script for Stability Analysis in R?

Dear All I have multi-environment data and would like to do stability analysis in R? I found two R function and respective packages like Agricolae and plant breeding? I tried both and got...

03 April 2018 551 12 View

How to use PCA (Principal Component Analysis) for core set preparation using genotyping/phenotyping data..

If anyone has any related article or link please do share with me.

01 January 1970 1,934 0 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Who will bear moral responsibility for the deaths of thousands of people in the event of an earthquake? Weeks and months remain before the onset of strong earthquakes that bring death to...

08 August 2024 6,134 12 View

Filippo Biscarini

A quick initial answer:

some common distance functions are Euclidean distances for continuous variables, and Hamming distances for categorical variables. You have a mixture of the categorical and continuous variables, though.

One solution is to use distance functions designed for both types of variables, like for instance:

- Mahalanobis distances (e.g. see McCane, Brendan, and Michael Albert. "Distance functions for categorical and mixed variables." Pattern Recognition Letters 29, no. 7 (2008): 986-993.)

-Gowers (di s)similarity coefficient

Alternatively, you could look at "PCA" like approaches which, though they don't directly compute "distances", are used to cluster observations and can be transformed (with a little algebra) to distances. For mixed categorical/continuous variables, you may look at Factor Analysis .

Anshuman Tiwari

Thanks Filippo for answering my question. Mahalanobis distances and Gowers (di s)similarity coefficient are new to me for distance calculation, so i will look these alorithm. I am using R for analysis. Is these algorithm available in R?

As I was not aware of above suggested algorithm, currently I am converting the categorical observations into numerical values for calculation. For example, trait leaf color observation was taken in three categories; Green, Light green, pale green so I am replacing the categorical data with numerical values by giving them code for example Green=1, Light Green =2 and pale green =3 etc. So l am converting all categorical trait data into numerical values and then using Manhattan for distance calculation in R or Darwin software.

Is this correct way to tackle the mix of continuous and categorical data or we should use only specially designed algorithm like you suggest above for the same.

Thanks

Treating unordered categorical data as numeric is dangerous, since you are introducing an order (0, 1, 2 etc ...), where there is none. Even if categories are ordered, intervals may be uneven, which is not reflected in equally spaced integers.

I would suggest to go for methods appropriate for mixed continuous and categorical variables, since this is the nature of you problem. Then again, this ultimately depends on your objective: if you need something quick and dirty, you can go with the "numerical" conversion. Otherwise it may be advisable to invest some time into doing things properly.

Thanks Filippo, you cleared my concepts on this topic and as I want to be precise on my analysis, I will use the algorithm suggested by you.

Is it ok to use Jaccard coefficient for genetic distance calculation using molecular marker data in binary format (converting alleles into 1/0 binary format)? Is there any more appropriate algorithm for marker data?

This depends on the type of markers you have. Jaccard's distance is suited for dominant markers, like AFLPs, which are indeed often coded as 0/1.

Have a nice day!

Arthur Tavares de Oliveira Melo

I absolutely agree with Filippo... Gower's Coefficient seem to be better.