Question

Imputation of vcf file

4

Entering edit mode

6.6 years ago

mab658 ▴ 120

Hi everyone, I have vcf file in dosage format like 0/0, 0/1 and etc. After filtering for quality, I need to impute the missing value using BEAGLE software so that my variants can be in float numeric format in order to combine with phenotypic data for further analysis. Apart from BEAGLE, if there is any other software that is much easier to use, I would not mind. Could someone help me with script and guide me through imputation, please.

SNP • 9.2k views

ADD COMMENT • link updated 3.1 years ago by elielsonveloso • 0 • written 6.6 years ago by mab658 ▴ 120

0

Entering edit mode

Could you tell me your email. I want to ask some questions about Beagle. Thank you.

ADD REPLY • link 6.0 years ago by zhangying • 0

0

Entering edit mode

Perhaps it is better to ask the questions here (?). You can open a new question, if you want (?)

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Do you knoe if it is possible to run Beagle on a merged vcf file containing genotypic information from 40 patients and about ~500 SNPs?

ADD REPLY • link 3.1 years ago by elielsonveloso • 0

Kevin Blighe · Accepted Answer · 2018-12-21

1

Entering edit mode

6.6 years ago

Kevin Blighe 89k

Which imputation program are you going to use?

ADD COMMENT • link 6.6 years ago by Kevin Blighe 89k

0

Entering edit mode

I intend to use BEAGLE for imputation. Thanks

ADD REPLY • link 6.6 years ago by mab658 ▴ 120

0

Entering edit mode

Beagle can accept VCF format. Take a look here:

I would recommend 'cleaning' your VCF first, such as:

splitting multi-allelic calls
left-aligning indels
removing 'junk' variants that have a high statistical probability of being false-positives via the QUAL score, which is negative log 10 Phred-scaled quality.

ADD REPLY • link 6.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin. I had succeeded in using beagle for imputation. The command goes like

java -Xmx24g -jar /programs/beagle41/beagle41.jar gt=/workdir/moshood/snps/snp_filtered.vcf.gz.recode.vcf nthreads=8 niterations=0  out=imputed_file

Hope I am doing the right thing without using ref option. Having imputed, I intend to have my return outfile as dosage format for further analysis. How do I go about this?

ADD REPLY • link updated 6.5 years ago by Kevin Blighe 89k • written 6.5 years ago by mab658 ▴ 120

0

Entering edit mode

How does the output look currently?

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

After imputation using beagle 4.1, I have output with genotypes phased like 0|0, 0|1, etc as follows:

##fileformat=VCFv4.2
##filedate=20181226
##source="beagle.27Jul16.86a.jar (version 4.1)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation between most probable REF dose and true REF dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose">
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TMS14F1302P0010:250300956   TMS14F1304P0015:250301032   TMS14F1295P0012:250300648   TMS14F1303P0016:250300982   TMS14F1308P0010:250301113   TMS14F1311P0006:250301205   TMS14F1309P0016:250301170   TMS14F1298P0016:250300718   TMS14F1297P0007:250300688   TMS14F1296P0009:250300665   TMS14F1307P0010:250301093
10  9135    S10_9135    A   G   .   PASS    .   GT  1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0
10  17928   S10_17928   A   G   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
10  20405   S10_20405   T   C   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
10  20411   S10_20411   G   T   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  20435   S10_20435   A   T   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  22620   S10_22620   T   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  22621   S10_22621   A   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  27736   S10_27736   G   A   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  27741   S10_27741   G   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  33495   S10_33495   G   T   .   PASS    .   GT  1|0 1|1 1|0 0|1 0|0 1|1 1|0 0|1 0|1 1|0 0|1
10  33512   S10_33512   G   A   .   PASS    .   GT  1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0
10  46535   S10_46535   G   A   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  57494   S10_57494   C   T   .   PASS    .   GT  0|0 1|1 1|0 0|0 0|0 0|1 1|0 0|0 0|1 0|0 0|1
10  57542   S10_57542   T   A   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  57587   S10_57587   A   G   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  57608   S10_57608   T   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  100860  S10_100860  A   T   .   PASS    .   GT  1|0 1|1 1|0 0|1 0|0 1|1 1|0 0|1 0|1 1|0 0|1
10  100881  S10_100881  C   T   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  117841  S10_117841  T   G   .   PASS    .   GT  1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0

My question is how do I have the genotypes recoded as 0, 1, 2 for further analysis in R. I am planning to merge it with phenotype data for genomic prediction.

ADD REPLY • link updated 6.5 years ago by Kevin Blighe 89k • written 6.5 years ago by mab658 ▴ 120

0

Entering edit mode

Can you try this: A: Convert VCF dosage file into dosage only file

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

plink --vcf file.vcf --recode A --out file_matrix

ADD REPLY • link 5.8 years ago by Wan Shi Tong ▴ 70

0

Entering edit mode

Thanks Kevin. Your response has been very helpful! Prior to imputation, I cleaned my VCF file using

vcftools --gzvcf snps.vcf.gz --maf 0.01 --max-missing 0.8 --min-meanDP 10 --recode --out snp_filtered.vcf.gz

Please how do I filter 'junk' variants that have a high statistical probability of being false-positives via QUAL score. kindly give me an example of how to modify my above vcftools command to remove such 'junk' variants. Thanks for the support.

ADD REPLY • link 6.5 years ago by mab658 ▴ 120

0

Entering edit mode

You could just filter out those with QUAL < 30 (equating to 1 in 1000 chance of being incorrect). Depending on the pipeline that you used to produce the input VCF, these may already be marked as low quality in the FILTER column. Variants called at low read-depths are also questionable.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k