This post is going to be a little long, so I apologize in advance for bothering you. I learned that quality control is super important so I want to make sure that I do this right.
I have several questions for my QC pipeline but hopefully nothing complex. I am just having some hard time to find the right commands. and when I do I am not sure of the inputs. Note that I am using plink v1 (because all the examples I found use plink v1) so if you can give me the v2 version that would be helpful but it is not my main concern. Note also that I am using data from UK biobank so every chromosome is in separate files (genotyped: .bed .bim . fam / imputed: .bgen .mfi .sample)
My pipeline is based on 2 parts :
1- per individual filtering
a. Removal of individuals with excess missing genotypes b. Removal of individuals with outlying homozygosity values c. Removal of samples showing a discordant sex d. Removal of related or duplicate samples e. Removal of ancestry outliers
2 - per sample filtering
a. Removal of SNPs with excess missing genotypes b. Removal of SNPs that deviate from Hardy-Weinberg equilibrium 3. Removal of SNPs with low minor allele frequency c. Comparing minor allele frequency to known values
My code:
1.a: ./plink --bfile ukb_cal_chr{}_v2_reduced --missing --out output1 comment: use R to remove individuals with missingness >.05
1.b: ./plink --bfile ukb_cal_chr{]_v2_reduced --het --out output2 comment: use R to remove individuals with the absolute value of F >.05
1.c: ./plink --bfile ukb_cal_chr{}_v2_reduced --check-sex --out output3 comment: not sure what the input is ? then in the output remove individuals with status =problem
1.d: ./plink --bfile ukb_cal_chr{}_v2_reduced --genome --min 0.05 --out output4 comment: using R in the output output4.genome, for every pair remove the one with the lowest genotyping rate (unless there is a command for that in plink ) (is that right?)
!!! However, I found that --genome takes too much time, is there another way?
1.e: ..... comment: I found this command :
plink --file data --cluster --neighbour 1 5
comment: but I am not sure what it did and how to use the output to filter the individuals and what the input file is (file or bfile)
2 - a,b,c : ./plink --bfile input --maf 0.01 --hwe 1e-6 --mind .1 --geno .1 --make-bed --out output
That's it for my pipeline. my main questions are related to the red parts, so just 3 questions. Also, if you found errors in my pipeline can you please correct me?
In conclusion here are my 3 questions:
- since I have one file for each chromosome, is the input of the command 1.c , the chromosome X?
- the command -- genome takes a lot of time, is there a way to speed it up or to estimate the relatedness of individuals in another way?
- I am still not sure how to filter ancestry outlier using pca?
Can you please help me? thank you