Imputing missing genotypes using Beagle
3
0
Entering edit mode
3.3 years ago

Hi! I want to use Beagle version 5.1. to exclusively impute my missing genotypes (no ref. panel needed). I used .ped and .map files to create my .vcf file using plink1.09b. My file 394.RMV-UNKNOW-SEXUAL-Filtered.vcf looks like:

contig=<ID=29,length=51502869>
INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real refere$
FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1_V_UK561500501863_plate1_A01   2_V$
1       16947   BovineHD0100000005      C       T       .       .       PR      GT      0/1     0/0     0/0
1       135098  Hapmap43437-BTA-101   T       C       .       .       PR      GT      0/1     0/1     0/0
1       149772  BovineHD0100000042      G       A       .       .       PR      GT      0/0     0/0     0/1
1       158820  BovineHD0100000048      G       T       .       .       PR      GT      0/1     1/1     0/0
1       163995  BovineHD0100000051      T       C       .       .       PR      GT      0/0     0/0     0/0
1       183040  BovineHD0100000057      T       G       .       .       PR      GT      0/1     0/1     ./.
1       267940  ARS-BFGL-NGS-16466    T       C       .       .       PR      GT      0/1     0/0     1/1
1       290690  BovineHD0100000082      A       C       .       .       PR      GT      0/0     0/1    0/1

When I run the command:

java -Xss5m -Xmx4g -jar /exports/eddie/scratch/v1mmart8/Test/beagle.18May20.d20.jar gt=394.RMV-UNKNOW-SEXUAL-Filtered.vcf out=394_imputed impute=false

The output created 394_imputed.vcf.gz is empty. The error file says_ERROR: REF field is not a sequence of A, C, T, G, or N characters at 1:78986953 [D] I tried to include missing=./. in the command but beagle does not recognize it.

Any idea why is it not running? Many thanks!

Beagle missing • 3.1k views
ADD COMMENT
0
Entering edit mode

I might be wrong but I did not see the option missing. Also should your REF field contain missing genotypes? I think you have a malformed VCF file, or the ID field contains some special character that is causing that error. Can you find 1:78986953 in the vcf file and post it? The other issue I noticed is that your missing data is . when I think it should be ./..

ADD REPLY
0
Entering edit mode

Are you trying to impute 3 samples without a reference panel? I do not think this will work well at all. I would think more like 100-1000 might be okay. The more the better without a reference panel.

ADD REPLY
0
Entering edit mode
3.3 years ago

Hi! Many thanks for your response!

As you said, many issues here: 1)With egrep '78986953' I get 1 78986953 ARS-BFGL-NGS-119431 D I . . PR GT 0/0 I think the D is the problem here.. I made a quick search and this is the only case where this happens so I think I can solve it by removing this line. 2)Also, about the missings, I have them coded like "." for single characters in the columns of quality and filter, which is "./." when occurs in a SNP. However, when I include the command missing=. in Beagle I get the error: Error: unrecognized parameter: missing=. I wonder how Beagle wants the missing values to be coded? 3) I also detected I have some repeated SNPs (by position) in my .vcf file, any idea how to remove them?

To jean.elbers: no, I just want to impute my missing SNPs in my 50k file of 386 animals

Many thanks!

ADD COMMENT
0
Entering edit mode
3.3 years ago

I have removed the line with "D", and now beagle runs but does not finishes the work. I have aprox. 36 000 SNPs but it stops at line 568 and the error: WARNING: We recommend that you use a minimum of 4 GB of virtual memory when running Java 1.8.0_74 on Eddie. Please see the following for details: https://www.wiki.ed.ac.uk/display/ResearchServices/Java Exception in thread "main" java.lang.IllegalArgumentException: Duplicate marker: 1 59409838 ARS-USMARC-Parent-DQ404150-rs29012530_dup T C at vcf.Markers.markerSet(Markers.java:131) at vcf.Markers.<init>(Markers.java:85) at vcf.Markers.create(Markers.java:64) at vcf.BasicGT.markers(BasicGT.java:105) at vcf.BasicGT.<init>(BasicGT.java:86) at vcf.TargetData.targGT(TargetData.java:92) at vcf.TargetData.advanceWindowCm(TargetData.java:120) at main.Main.phaseData(Main.java:158) at main.Main.main(Main.java:113)

Some help to interpret it would be really useful Many thanks! Marina

ADD COMMENT
0
Entering edit mode

change -Xmx4g to -Xmx100g or however much ram you have available (use free -h)

ADD REPLY
0
Entering edit mode
3.3 years ago

Hi, many thanks for the reply! I tried several but I am restricted to 4g.. (-Xmx4g). However, I think my issue is that I have duplicated SNPs in my .vcf file.. Any ideas how to delete them? Many thanks!

ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Dear Jean,

Thanks, I follow it and it worked!. Just a last question so I make sure I did not miss anything Do we use the same commands (call java, gt= and out=) to phase a reference database with Beagle and to impute missing values in our case study population without a reference? (of course with different entry and ouput files) Is it normal that I do not get any imputation accuracy if I am only imputing missing positions in the genotyped SNPs? Many thanks for your time!

ADD REPLY
0
Entering edit mode

Please see http://faculty.washington.edu/browning/beagle/beagle_5.1_08Nov19.pdf especially section 5 about output

ADD REPLY
0
Entering edit mode

@marinamartinezalvaro Please can you keep any additional comments or questions to your original post. Answers are reserved for answers, rather then comments or questions.

ADD REPLY

Login before adding your answer.

Traffic: 1295 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6