I have biallelic SNP data of a gene, and would want to analyse the LD using Haploview, but I have absolutely no idea on the input, like the ped/linkage formats and block or legend files. Can anyone help?
Hello Raunaq. I suggest you SNPstats. It is a very easy online tool to analyze SNPs (LD and association tests) . Here is the link http://bioinfo.iconcologia.net/snpstats/start.htm?
For such kind of analysis PLINK works very well! And it is also very simple to use! (http://pngu.mgh.harvard.edu/~purcell/plink/anal.shtml#adjust) look at LD SNPs pruning section:
"Sometimes it is useful to generate a pruned subset of SNPs that are in approximate linkage equilibrium with each other. This can be achieved via two commands: --indep which prunes based on the variance inflation factor (VIF), which recursively removes SNPs within a sliding window; second, --indep-pairwise which is similar, except it is based only on pairwise genotypic correlation.
Hint The output of either of these commands is two lists of SNPs: those that are pruned out and those that are not. A separate command using the --extract or --exclude option is necessary to actually perform the pruning.
The VIF pruning routine is performed:
plink --file data --indep 50 5 2
will create files
plink.prune.in
plink.prune.out
Each is a simlpe list of SNP IDs; both these files can subsequently be specified as the argument for a --extract or --exclude command.
The parameters for --indep are: window size in SNPs (e.g. 50), the number of SNPs to shift the window at each step (e.g. 5), the VIF threshold. The VIF is 1/(1-R^2) where R^2 is the multiple correlation coefficient for a SNP being regressed on all other SNPs simultaneously. That is, this considers the correlations between SNPs but also between linear combinations of SNPs. A VIF of 10 is often taken to represent near collinearity problems in standard multiple regression analyses (i.e. implies R^2 of 0.9). A VIF of 1 would imply that the SNP is completely independent of all other SNPs. Practically, values between 1.5 and 2 should probably be used; particularly in small samples, if this threshold is too low and/or the window size is too large, too many SNPs may be removed.
The second procedure is performed:
plink --file data --indep-pairwise 50 5 0.5
This generates the same output files as the first version; the only difference is that a simple pairwise threshold is used. The first two parameters (50 and 5) are the same as above (window size and step); the third parameter represents the r^2 threshold. Note: this represents the pairwise SNP-SNP metric now, not the multiple correlation coefficient; also note, this is based on the genotypic correlation, i.e. it does not involve phasing.
To give a concrete example: the command above that specifies 50 5 0.5 would a) consider a window of 50 SNPs, b) calculate LD between each pair of SNPs in the window, b) remove one of a pair of SNPs if the LD is greater than 0.5, c) shift the window 5 SNPs forward and repeat the procedure.
To make a new, pruned file, then use something like (in this example, we also convert the standard PED fileset to a binary one):
plink --file data --extract plink.prune.in --make-bed --out pruneddata