Question: Filtering Imputation output: how to filter on a VCF INFO field
2
gravatar for mmagoo6
2.5 years ago by
mmagoo650
mmagoo650 wrote:

I used the Michigan Imputation Server to impute data that I have, and got three files per chromosome as output: .dose.vcf.gz, .dose.vcf.gz.tbi, and .info.gz. I only want to keep genotypes that are imputed with an R2 that is greater than 0.3. The .dose.vcf.gz has everything (I think) that I need to create a plink output file with only the high quality genotypes, but I am having a hard time getting vcftools to understand what I am asking it to do.

vcftools --vcf chr1.dose.vcf.gz --filter MinR2=.30 --plink --out plink_chr1
Error: Unknown option: --filter

Can someone please help me figure out how to filter on the value of an INFO field?

I have looked at the vcftools documentation and examples, and haven't yet figured out how to filter on INFO. The info filtering mentioned here only filters if the field exists at all, not for particular values of it. Any suggestions would be appreciated. is VCFtools not the correct tool for this?

Here is the header of the .dose.vcf.gz:

##fileformat=VCFv4.1
##filedate=2016.8.6
##source=Minimac3
##contig=<ID=1>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">
##INFO=<ID=MAF,Number=1,Type=Float,Description="Estimated Minor Allele Frequency">
##INFO=<ID=R2,Number=1,Type=Float,Description="Estimated Imputation Accuracy">
##INFO=<ID=ER2,Number=1,Type=Float,Description="Empirical (Leave-One-Out) R-square (available only for genotyped variants)">
ADD COMMENTlink modified 6 months ago by natalia.han0 • written 2.5 years ago by mmagoo650

Filtering Vcf File

ADD REPLYlink written 2.5 years ago by microfuge980
3
gravatar for mmagoo6
2.5 years ago by
mmagoo650
mmagoo650 wrote:

OK, so I figured it out with plink. I removed all multi-allelic variants, and variants with the same name (which might not be ideal for everyone), but in short, I went from the michigan server output to a single plink file of everything with R2>0.3 with the help of this hack from Pontikos' github in this way:

#sign in to the michigan server on chrome, go to the "job" section and pick the job of interest
#Go to the "results" tab
#In the chrome development "console" type  copy(document.body.innerHTML);
#in nano paste that information into download_page.html
#paste the one-time-use password into password.txt
python extract.py >urls.txt
for x in `cat urls.txt`; do wget $x; done;
password=`cat password.txt`
for x in *.zip
do
    echo unzip -P $password -o $x
    unzip -P $password -o $x
done
for chnum in {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22};
  do
  gunzip chr$chnum.info.gz
  plink --vcf chr$chnum.dose.vcf.gz --make-bed --out s1
  plink --bfile s1 --bmerge s1 --merge-mode 6
  plink --bfile s1 --exclude plink.missnp --make-bed --out s2
  plink --bfile s2 --list-duplicate-vars
  plink --bfile s2 --exclude plink.dupvar --make-bed --out s3
  plink --bfile s3 --qual-scores chr$chnum.info 7 1 1 --qual-threshold 0.3 --make-bed --out ../plinkout/chr$chnum
done
#create file merge.list that contains the directory of each chromosome
plink --bfile ../plinkout/1 merge-list merge.list --make-bed --out imputepostqc
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by mmagoo650

its also possible to filter the vcf.gz file directly using bcftools filter -e

ADD REPLYlink written 2.5 years ago by vakul.mohanty240

Bcftools' filter removes sites, not individual genotypes.

ADD REPLYlink written 2.2 years ago by luisrmacias0

Do you have any recommendations for the filtering value of GP

##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1).

I use no 0.85 however not sure how scientific that value is... Any ideas or suggestions?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Floris Brenk870

There is an error in the --qual-threshold step. Error: Duplicate variant '1:871334' in --qual-scores file

ADD REPLYlink written 9 months ago by kvshamsudheen0

Hi, thank you for your post. It helps me a lot with filtering dose.vcf.gz and converting them to plink format. However, I found the filtered files have far less number of rows than the files filtered by:

bcftools view -i 'R2>.8 & MAF>.05' -Oz chrnum.dose.vcf.gz > chrnum.filtered.vcf.gz

And I think converting the dose.vcf.gz to plink in this way leads to the lost the dosage information. Maybe need to use software like DosageConvertor to convert dose.vcf.gz to plink dosage files.

ADD REPLYlink modified 6 months ago • written 6 months ago by Sunshine n Rain10
1
gravatar for fengjunjiao54
2.4 years ago by
fengjunjiao5410 wrote:

Thank you for your sharing. I have try Pontikos' github you pasted. however, I wander what is the extract.py ,is there another python script? for I get the error: python extract.py >urls.txt syntaxEror : invalid syntax. any suggestions will be helpful for me .

Thank you !

ADD COMMENTlink written 2.4 years ago by fengjunjiao5410

The extract.py seems like a script to extract the download urls. You can ignore that line and take what else you need.

ADD REPLYlink written 2.2 years ago by Tao270
0
gravatar for Tao
2.2 years ago by
Tao270
Tao270 wrote:

Hi, This post is very helpful for me as I am also looking for protocols for QC imputation results. Besides the QC steps you mentioned above, I wonder if you add other QC steps? If yes, please kindly tell me. Thanks!

Tao

ADD COMMENTlink written 2.2 years ago by Tao270
0
gravatar for shraddha.pai
14 months ago by
shraddha.pai0 wrote:

Pontiko's code snippet is very useful. In the gist above, what is this line achieving? Why would you want to merge a file with itself? plink --bfile s1 --bmerge s1 --merge-mode 6

ADD COMMENTlink written 14 months ago by shraddha.pai0

That is for detecting non-biallelic variants. For example, one SNP might has 3 alleles: A -> C, A -> G. And the this step will report such SNPs as mis-match SNPs.

ADD REPLYlink written 14 months ago by Tao270
0
gravatar for natalia.han
6 months ago by
natalia.han0 wrote:

I believe you can also do this in order to only get biallelic variants:

  plink --vcf chr$chnum.dose.vcf.gz --make-bed --double-id --biallelic-only  --out s1
ADD COMMENTlink written 6 months ago by natalia.han0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 691 users visited in the last hour