This post is going to be a little long, so I apologize in advance for bothering you. I learned that quality control is super important so I want to make sure that I do this right.

I have several questions for my QC pipeline but hopefully nothing complex. I am just having some hard time to find the right commands. and when I do I am not sure of the inputs.  Note that I am using plink v1 (because all the examples I found use plink v1) so if you can give me the v2 version that would be helpful but it is not my main concern. Note also that I am using data from UK biobank so every chromosome is in  separate files (genotyped: .bed .bim . fam  / imputed: .bgen .mfi .sample)

My pipeline is based on 2 parts :

1- per individual filtering

      a. Removal of individuals with excess missing genotypes       b. Removal of individuals with outlying homozygosity values       c. Removal of samples showing a discordant sex       d. Removal of related or duplicate samples       e. Removal of ancestry outliers

2 - per sample filtering

      a. Removal of SNPs with excess missing genotypes       b. Removal of SNPs that deviate from Hardy-Weinberg equilibrium 3. Removal of SNPs with low minor allele frequency       c. Comparing minor allele frequency to known values

My code:

1.a: ./plink --bfile ukb_cal_chr{}_v2_reduced --missing     --out  output1 comment: use R to remove individuals with missingness >.05

1.b: ./plink --bfile ukb_cal_chr{]_v2_reduced --het             --out output2   comment: use R to remove individuals with the absolute value of F >.05

1.c: ./plink --bfile ukb_cal_chr{}_v2_reduced --check-sex  --out output3   comment: not sure what the input is ? then in the output remove individuals with status =problem

1.d: ./plink --bfile ukb_cal_chr{}_v2_reduced --genome --min 0.05 --out output4 comment: using R in the output  output4.genome, for every pair remove the one with the lowest genotyping rate (unless there is a command for that  in plink ) (is that right?)

!!! However, I found that --genome takes too much time, is there another way?

1.e: ..... comment:   I found this command : 

plink --file data --cluster --neighbour 1 5

comment: but I am not sure what it did and how to use the output to filter the individuals and what  the input file is (file or bfile)

2 - a,b,c : ./plink  --bfile input  --maf 0.01 --hwe 1e-6 --mind .1 --geno .1 --make-bed --out output

That's it for my pipeline. my main questions are related to the red parts, so just 3 questions. Also, if you found errors in my pipeline can you please correct me?

In conclusion here are my 3 questions:

- since I have one file for each chromosome, is the input of the command 1.c , the chromosome X?

- the command -- genome takes a lot of time, is there a way to speed it up or to estimate the relatedness of individuals in another way?

- I am still not sure how to filter ancestry outlier using pca?

Can you please help me? thank you

More Emile Dimas's questions See All
Similar questions and discussions