Beagle 4.1 error : Possible data conversion issue
2
1
Entering edit mode
5.4 years ago
aritra90 ▴ 60

Hi, I have PLINK format data (PED/MAP)  and I wanted to convert this to VCF so that I can input it in BEAGLE 4.1 to phase them, as BEAGLE only use VCF format. I wanted a trivial one line solution and not a pipeline using PSEQ or MEGA2, etc.

I saw in PLINK1.9 one can just use --recode vcf to achieve this. However when I did this and ran beagle (gt) on the input its giving me Java exceptions/errors. Its not a problem with beagle jar file as it runs well with the sample VCF format data downloaded from 1000Genomes. However, when I convert the data to VCF using PLINK and then use it as BEAGLE 4.1 input, then it doesn't like it. It'd be great if anyone can help me with this, such as, if there's any workaround, other simplistic methods to convert PLINK to VCF for BEAGLE input.

Error snippet:

Exception in thread "main" java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: nSamples==0
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at h.G.c(Unknown Source) at h.G.a(Unknown Source) at main.Main.main(Unknown Source) Caused by: java.lang.IllegalArgumentException: nSamples==0 at h.I.<init>(Unknown Source) at h.e.<init>(Unknown Source) at h.G.a(Unknown Source) at h.G.a(Unknown Source) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747) at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)

To give a description of what I am doing to convert PLINK to VCF:

2) Converting .bgl to vcf using beagle2vcf.jar

3) post processing to make it tab separated.

4) running Beagle 4.1 only to get the aforementioned error.

Thanks,
Aritra

SNP PLINK BEAGLE VCF • 5.0k views
0
Entering edit mode

Can you post the errors?

0
Entering edit mode

Hi Zev,

Thanks.

0
Entering edit mode

Have you tried running --list-duplicate-vars, and then using --exclude on the listed variant IDs before exporting a VCF?

Also, what do the first few non-header lines of the VCF look like?

0
Entering edit mode

Thanks for the input, Christopher. I did --list-duplicate-vars but for my particular dataset it didn't return any duplicate variants (I do get dupvars for other datasets which I am not using currently) The non-header lines of VCF file (after --recode vcf) looks like this:

1       752566  rs3094315       T       C       .       .       PR      GT      0/0     0/0     1/1     0/0     0/1     0/1     0/0     1/1     0/0     0/0     1/1     ......


the header lines look like this:

##fileformat=VCFv4.2
##fileDate=20151203
##contig=<ID=1,length=249198165>
##contig=<ID=2,length=242996590>
##contig=<ID=3,length=197793906>
##contig=<ID=4,length=190723161>
##contig=<ID=5,length=180666277>
##contig=<ID=6,length=170823380>
##contig=<ID=7,length=158928570>
##contig=<ID=8,length=146239141>
##contig=<ID=9,length=141010458>
##contig=<ID=10,length=134966155>
##contig=<ID=11,length=134905782>
##contig=<ID=12,length=133734114>
##contig=<ID=13,length=115074879>
##contig=<ID=14,length=107285438>
##contig=<ID=15,length=102388693>
##contig=<ID=16,length=90141356>
##contig=<ID=17,length=81004771>
##contig=<ID=18,length=77984346>
##contig=<ID=19,length=58949580>
##contig=<ID=20,length=62906515>
##contig=<ID=21,length=48050389>
##contig=<ID=22,length=51024838>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT........


The error that I get in Beagle 4 when I am using the --recode-vcf file is this:

Exception in thread "main" java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: 1 (followed by a bunch of java exceptions)


It'd be great if you can help me with this.

Thanks.

0
Entering edit mode

I'm running into the same issue. Have you resolved this or found another work-around?

0
Entering edit mode

Have run into the same problem. Did you ever find a solution to it?

4
Entering edit mode
5.3 years ago

I think I found the issue. The current version of plink (as of 1/7/2016) has a subtle bug that discards the alternate allele code if there is actually no genetic variation present in the data (due to missing calls, etc.). Until this is fixed a current work-around is to simply remove any allele that has MAF = 0 before hand. Hopefully there aren't too many.

2
Entering edit mode

This is required by the VCF specification; it is not a PLINK bug. (That's why TASSEL has the same "bug".) With that said, it sounds like you found the best workaround.

0
Entering edit mode

This sounds like a good alternative. Thanks guys :)

0
Entering edit mode

Just to say this solved my

Exception in thread "main" java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: 1


issue as well :) Thanks!

0
Entering edit mode
5.4 years ago

It looks like you ran into a problem with the difference between Beagle 3 and Beagle 4

This is your main error message:

Caused by: java.lang.IllegalArgumentException: nSamples==0


First of all, that looks like a comparison instead of an assignment (shouldn't it be nSamples=0?), second of all, looking at the Beagle v4.1 manual there is no argument nSamples (only nthreads, niterations), that argument however exists in the Beagle 3 manual

If you look at the release notes, nsamples was dropped in the Beagle 4.1 (03Oct15.284) release

Solution: Use Beagle 3, or drop the nsamples parameter if you don't need it

0
Entering edit mode

Hi Philipp,

I have Beagle 3 working, but wanted to use Beagle 4 as it has some advancements which we will need. I know that Beagle 4.1 doesn't have any nsamples argument and I don't include that when I run the jar file as well.

I run this: java -jar beagle.09Nov15.d2a.jar gt=inputfile.vcf.gz out=outfile.gt and get that Java exception, which baffled me. I need to use Beagle as a part of a script, hence need to make it generic as much as I can without dependencies like Mega2, PLINKSEQ, etc.

Aritra

1
Entering edit mode

Thank you for the command!

Tricky! I've now decompiled Beagle 4.1 and it does use nSamples internally a few times, so I wouldn't be surprised if your VCF accidentally triggers stuff like this:

if (paramF.c() == 0) {
throw new IllegalArgumentException("nSamples==0");
}


So internally it keeps on using nSamples but you as the user can't touch it. It seems to build the number for nSamples automatically in Beagle 4.

Since it's decompiled Java code which leads to weird variable and method-names it's hard to check what's going on exactly, all I can see is that the method c() checks for the length of something (number of alleles? SNPs? individuals?).

I can't check the original source code since Beagle's page says that that one is only available once the paper is out.

So you're indeed correct in your first post, there's something weird or missing in the PLINK-converted output that Beagle assumes something about. Can you find any differences between the file you now have (inputfile.vcf.gz) and the 1000Genomes file that originally worked? / instead of |?

You could try to convert the PLINK files using TASSEL's graphical interface - Data -> Load -> Load PLINK, then followed by Data -> Export -> Write VCF, maybe that output file will have the thing Beagle is missing in PLINK 1.9's conversion output

Edit: TASSEL is here: http://www.maizegenetics.net/#!tassel/c17q9

0
Entering edit mode

Hi Philipp,

Thanks for looking into it so much. TASSEL and PLINK --recode vcf gives the same file as output and I get the following error when I run those:

Caused by: java.lang.IllegalArgumentException: 1
at h.d.b(Unknown Source)
at h.d.<init>(Unknown Source)
at h.I.<init>(Unknown Source)
at h.e.<init>(Unknown Source)
at h.G.a(Unknown Source)
at h.G.a(Unknown Source)


But, when I convert PLINK to .bgl using --recode bgl -nomap and then convert it to VCF using beagle2vcf.jar from Beagle 4.1 utilities I get the nSamples==0 error, as .bgl is BEAGLE 3 format, I think it's got something to do with that.

I am kind of hitting a roadblock here, hence, any help would be appreciated, as I didn't want to go back to Beagle 3.

Thanks,

Aritra

0
Entering edit mode

Have you contacted the Beagle authors? There's obviously something wonky in the nSamples approximation.

At this point I'd go through both of your files (the 1000 Genomes files that work and your converted ones that don't) and check for any difference that may trip Beagle up, which the Beagle people may help you better with. After all, they're interested in having their software work with files generated by software such as common as Plink or TASSEL.