Hi everyone, I have vcf file in dosage format like 0/0, 0/1 and etc. After filtering for quality, I need to impute the missing value using BEAGLE software so that my variants can be in float numeric format in order to combine with phenotypic data for further analysis. Apart from BEAGLE, if there is any other software that is much easier to use, I would not mind. Could someone help me with script and guide me through imputation, please.
Which imputation program are you going to use?
Beagle can accept VCF format. Take a look here:
I would recommend 'cleaning' your VCF first, such as:
- splitting multi-allelic calls
- left-aligning indels
- removing 'junk' variants that have a high statistical probability of being false-positives via the QUAL score, which is negative log 10 Phred-scaled quality.
Thanks Kevin. I had succeeded in using beagle for imputation. The command goes like
java -Xmx24g -jar /programs/beagle41/beagle41.jar gt=/workdir/moshood/snps/snp_filtered.vcf.gz.recode.vcf nthreads=8 niterations=0 out=imputed_file
Hope I am doing the right thing without using ref option. Having imputed, I intend to have my return outfile as dosage format for further analysis. How do I go about this?
After imputation using beagle 4.1, I have output with genotypes phased like 0|0, 0|1, etc as follows:
##fileformat=VCFv4.2
##filedate=20181226
##source="beagle.27Jul16.86a.jar (version 4.1)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation between most probable REF dose and true REF dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose">
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TMS14F1302P0010:250300956 TMS14F1304P0015:250301032 TMS14F1295P0012:250300648 TMS14F1303P0016:250300982 TMS14F1308P0010:250301113 TMS14F1311P0006:250301205 TMS14F1309P0016:250301170 TMS14F1298P0016:250300718 TMS14F1297P0007:250300688 TMS14F1296P0009:250300665 TMS14F1307P0010:250301093
10 9135 S10_9135 A G . PASS . GT 1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0
10 17928 S10_17928 A G . PASS . GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
10 20405 S10_20405 T C . PASS . GT 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
10 20411 S10_20411 G T . PASS . GT 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10 20435 S10_20435 A T . PASS . GT 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10 22620 S10_22620 T C . PASS . GT 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10 22621 S10_22621 A C . PASS . GT 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10 27736 S10_27736 G A . PASS . GT 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10 27741 S10_27741 G C . PASS . GT 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10 33495 S10_33495 G T . PASS . GT 1|0 1|1 1|0 0|1 0|0 1|1 1|0 0|1 0|1 1|0 0|1
10 33512 S10_33512 G A . PASS . GT 1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0
10 46535 S10_46535 G A . PASS . GT 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10 57494 S10_57494 C T . PASS . GT 0|0 1|1 1|0 0|0 0|0 0|1 1|0 0|0 0|1 0|0 0|1
10 57542 S10_57542 T A . PASS . GT 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10 57587 S10_57587 A G . PASS . GT 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10 57608 S10_57608 T C . PASS . GT 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10 100860 S10_100860 A T . PASS . GT 1|0 1|1 1|0 0|1 0|0 1|1 1|0 0|1 0|1 1|0 0|1
10 100881 S10_100881 C T . PASS . GT 0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10 117841 S10_117841 T G . PASS . GT 1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0
My question is how do I have the genotypes recoded as 0, 1, 2 for further analysis in R. I am planning to merge it with phenotype data for genomic prediction.
Can you try this: A: Convert VCF dosage file into dosage only file
Thanks Kevin. Your response has been very helpful! Prior to imputation, I cleaned my VCF file using
vcftools
--gzvcf snps.vcf.gz --maf 0.01 --max-missing 0.8 --min-meanDP 10 --recode --out snp_filtered.vcf.gz
Please how do I filter 'junk' variants that have a high statistical probability of being false-positives via QUAL score. kindly give me an example of how to modify my above vcftools command to remove such 'junk' variants. Thanks for the support.
Could you tell me your email. I want to ask some questions about Beagle. Thank you.
Perhaps it is better to ask the questions here (?). You can open a new question, if you want (?)