Question: Imputation of vcf file
0
gravatar for mab658
8 months ago by
mab65830
mab65830 wrote:

Hi everyone, I have vcf file in dosage format like 0/0, 0/1 and etc. After filtering for quality, I need to impute the missing value using BEAGLE software so that my variants can be in float numeric format in order to combine with phenotypic data for further analysis. Apart from BEAGLE, if there is any other software that is much easier to use, I would not mind. Could someone help me with script and guide me through imputation, please.

snp • 817 views
ADD COMMENTlink modified 5 weeks ago by zhangying0 • written 8 months ago by mab65830

Could you tell me your email. I want to ask some questions about Beagle. Thank you.

ADD REPLYlink written 5 weeks ago by zhangying0

Perhaps it is better to ask the questions here (?). You can open a new question, if you want (?)

ADD REPLYlink written 5 weeks ago by Kevin Blighe47k
1
gravatar for Kevin Blighe
8 months ago by
Kevin Blighe47k
Kevin Blighe47k wrote:

Which imputation program are you going to use?

ADD COMMENTlink written 8 months ago by Kevin Blighe47k

I intend to use BEAGLE for imputation. Thanks

ADD REPLYlink written 8 months ago by mab65830

Beagle can accept VCF format. Take a look here:

I would recommend 'cleaning' your VCF first, such as:

  • splitting multi-allelic calls
  • left-aligning indels
  • removing 'junk' variants that have a high statistical probability of being false-positives via the QUAL score, which is negative log 10 Phred-scaled quality.
ADD REPLYlink written 8 months ago by Kevin Blighe47k

Thanks Kevin. I had succeeded in using beagle for imputation. The command goes like

java -Xmx24g -jar /programs/beagle41/beagle41.jar gt=/workdir/moshood/snps/snp_filtered.vcf.gz.recode.vcf nthreads=8 niterations=0  out=imputed_file

Hope I am doing the right thing without using ref option. Having imputed, I intend to have my return outfile as dosage format for further analysis. How do I go about this?

ADD REPLYlink modified 7 months ago by Kevin Blighe47k • written 7 months ago by mab65830

How does the output look currently?

ADD REPLYlink written 7 months ago by Kevin Blighe47k

After imputation using beagle 4.1, I have output with genotypes phased like 0|0, 0|1, etc as follows:

##fileformat=VCFv4.2
##filedate=20181226
##source="beagle.27Jul16.86a.jar (version 4.1)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation between most probable REF dose and true REF dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose">
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TMS14F1302P0010:250300956   TMS14F1304P0015:250301032   TMS14F1295P0012:250300648   TMS14F1303P0016:250300982   TMS14F1308P0010:250301113   TMS14F1311P0006:250301205   TMS14F1309P0016:250301170   TMS14F1298P0016:250300718   TMS14F1297P0007:250300688   TMS14F1296P0009:250300665   TMS14F1307P0010:250301093
10  9135    S10_9135    A   G   .   PASS    .   GT  1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0
10  17928   S10_17928   A   G   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
10  20405   S10_20405   T   C   .   PASS    .   GT  0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0 0|0
10  20411   S10_20411   G   T   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  20435   S10_20435   A   T   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  22620   S10_22620   T   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  22621   S10_22621   A   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  27736   S10_27736   G   A   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  27741   S10_27741   G   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  33495   S10_33495   G   T   .   PASS    .   GT  1|0 1|1 1|0 0|1 0|0 1|1 1|0 0|1 0|1 1|0 0|1
10  33512   S10_33512   G   A   .   PASS    .   GT  1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0
10  46535   S10_46535   G   A   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  57494   S10_57494   C   T   .   PASS    .   GT  0|0 1|1 1|0 0|0 0|0 0|1 1|0 0|0 0|1 0|0 0|1
10  57542   S10_57542   T   A   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  57587   S10_57587   A   G   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  57608   S10_57608   T   C   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  100860  S10_100860  A   T   .   PASS    .   GT  1|0 1|1 1|0 0|1 0|0 1|1 1|0 0|1 0|1 1|0 0|1
10  100881  S10_100881  C   T   .   PASS    .   GT  0|0 0|0 0|0 0|1 0|0 1|0 0|0 0|0 0|0 0|0 0|0
10  117841  S10_117841  T   G   .   PASS    .   GT  1|0 0|0 0|0 0|0 0|0 0|0 0|0 0|1 0|0 0|0 0|0

My question is how do I have the genotypes recoded as 0, 1, 2 for further analysis in R. I am planning to merge it with phenotype data for genomic prediction.

ADD REPLYlink modified 7 months ago by Kevin Blighe47k • written 7 months ago by mab65830

Can you try this: A: Convert VCF dosage file into dosage only file

ADD REPLYlink written 7 months ago by Kevin Blighe47k

Thanks Kevin. Your response has been very helpful! Prior to imputation, I cleaned my VCF file using

vcftools --gzvcf snps.vcf.gz --maf 0.01 --max-missing 0.8 --min-meanDP 10 --recode --out snp_filtered.vcf.gz

Please how do I filter 'junk' variants that have a high statistical probability of being false-positives via QUAL score. kindly give me an example of how to modify my above vcftools command to remove such 'junk' variants. Thanks for the support.

ADD REPLYlink modified 7 months ago • written 7 months ago by mab65830

You could just filter out those with QUAL < 30 (equating to 1 in 1000 chance of being incorrect). Depending on the pipeline that you used to produce the input VCF, these may already be marked as low quality in the FILTER column. Variants called at low read-depths are also questionable.

ADD REPLYlink written 7 months ago by Kevin Blighe47k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 542 users visited in the last hour