Usually, genotype codes like AA, AG, AA, AA, AT, AA, GG, GG, AA, GG, AA, AA will be recoded to 0, 1, 0, 0, 1, 0, 2, 2, 0, 2, 0, 0 because A is major allele while G is minor allele. What about if I have genotype codes like T, T, T, W, N, A, T, T, W?
I have 18000 number of SNPs from 555 number of sample. I have just checked the proportion of atgc and not atgc data for the first and the second SNP. Almost half of them is not atgc
They look like IUPAC ambiguity codes (Wikipedia define them all quite nicely https://en.wikipedia.org/wiki/Nucleic_acid_notation). They're not correct notation for genotypes so whoever made the file needs a talking to.
Also worth noting that 0 1 and 2 don't refer to minor vs major, but reference vs alternative. 0 is the reference allele, and since the reference is whatever was in the sequence of the individual used as the reference, it can be the minor allele. 1 is the first alternative, usually the most frequent of all the alternatives, and 2 is the second alternative.
Yes, my data consist of IUPAC symbols for nucleotide. I think I need to contact someone made the file.
Thank you very much for your explanation about homozygous reference, heterozygous, homozygous variant emily. I do not really understand them. Usually, I work with data with no reference and I used recodeSNPs function from scrime package in R to recode the data. Homozygous reference was evaluated based on most frequent homozygous genotypes. Unfortunately, this function do not work for my data