Imputation of vcf file
1
2
Entering edit mode
3.4 years ago
mab658 ▴ 90

Hi everyone, I have vcf file in dosage format like 0/0, 0/1 and etc. After filtering for quality, I need to impute the missing value using BEAGLE software so that my variants can be in float numeric format in order to combine with phenotypic data for further analysis. Apart from BEAGLE, if there is any other software that is much easier to use, I would not mind. Could someone help me with script and guide me through imputation, please.

SNP • 4.6k views
0
Entering edit mode

0
Entering edit mode

Perhaps it is better to ask the questions here (?). You can open a new question, if you want (?)

1
Entering edit mode
3.4 years ago

Which imputation program are you going to use?

0
Entering edit mode

I intend to use BEAGLE for imputation. Thanks

0
Entering edit mode

Beagle can accept VCF format. Take a look here:

I would recommend 'cleaning' your VCF first, such as:

• splitting multi-allelic calls
• left-aligning indels
• removing 'junk' variants that have a high statistical probability of being false-positives via the QUAL score, which is negative log 10 Phred-scaled quality.
0
Entering edit mode

Thanks Kevin. I had succeeded in using beagle for imputation. The command goes like

java -Xmx24g -jar /programs/beagle41/beagle41.jar gt=/workdir/moshood/snps/snp_filtered.vcf.gz.recode.vcf nthreads=8 niterations=0  out=imputed_file


Hope I am doing the right thing without using ref option. Having imputed, I intend to have my return outfile as dosage format for further analysis. How do I go about this?

0
Entering edit mode

How does the output look currently?

0
Entering edit mode

After imputation using beagle 4.1, I have output with genotypes phased like 0|0, 0|1, etc as follows:

##fileformat=VCFv4.2
##filedate=20181226
##source="beagle.27Jul16.86a.jar (version 4.1)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation between most probable REF dose and true REF dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose">
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TMS14F1302P0010:250300956   TMS14F1304P0015:250301032   TMS14F1295P0012:250300648   TMS14F1303P0016:250300982   TMS14F1308P0010:250301113   TMS14F1311P0006:250301205   TMS14F1309P0016:250301170   TMS14F1298P0016:250300718   TMS14F1297P0007:250300688   TMS14F1296P0009:250300665   TMS14F1307P0010:250301093
10  9135    S10_9135    A   G   .   PASS    .   GT  1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0
10  17928   S10_17928   A   G   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
10  20405   S10_20405   T   C   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
10  20411   S10_20411   G   T   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  20435   S10_20435   A   T   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  22620   S10_22620   T   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  22621   S10_22621   A   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  27736   S10_27736   G   A   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  27741   S10_27741   G   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  33495   S10_33495   G   T   .   PASS    .   GT  1|0 1|1 1|0 0|1 0|0 1|1 1|0 0|1 0|1 1|0 0|1
10  33512   S10_33512   G   A   .   PASS    .   GT  1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0
10  46535   S10_46535   G   A   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  57494   S10_57494   C   T   .   PASS    .   GT  0|0 1|1 1|0 0|0 0|0 0|1 1|0 0|0 0|1 0|0 0|1
10  57542   S10_57542   T   A   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  57587   S10_57587   A   G   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  57608   S10_57608   T   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  100860  S10_100860  A   T   .   PASS    .   GT  1|0 1|1 1|0 0|1 0|0 1|1 1|0 0|1 0|1 1|0 0|1
10  100881  S10_100881  C   T   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  117841  S10_117841  T   G   .   PASS    .   GT  1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0


My question is how do I have the genotypes recoded as 0, 1, 2 for further analysis in R. I am planning to merge it with phenotype data for genomic prediction.

0
Entering edit mode
0
Entering edit mode

plink --vcf file.vcf --recode A --out file_matrix

0
Entering edit mode

Thanks Kevin. Your response has been very helpful! Prior to imputation, I cleaned my VCF file using

vcftools --gzvcf snps.vcf.gz --maf 0.01 --max-missing 0.8 --min-meanDP 10 --recode --out snp_filtered.vcf.gz

Please how do I filter 'junk' variants that have a high statistical probability of being false-positives via QUAL score. kindly give me an example of how to modify my above vcftools command to remove such 'junk' variants. Thanks for the support.

0
Entering edit mode

You could just filter out those with QUAL < 30 (equating to 1 in 1000 chance of being incorrect). Depending on the pipeline that you used to produce the input VCF, these may already be marked as low quality in the FILTER column. Variants called at low read-depths are also questionable.