Haplotype phasing with Beagle on microarray output VCF, tunning suggestions?
Entering edit mode
3 months ago
jpablo • 0

I'm doing phasing with beagle 5.2 on SNP data from illumina microarray.

Starting from an unphased VCF with around 600,000 SNPs.

I also trio-phased the same VCF, so I have a phased control VCF

I run a simple pipeline,

java -Xmx4g -jar beagle.28Jun21.220.jar impute=false gt=source.vcf map=./hapmap/plink.chr1.GRCh37.map out=out iterations=40 ref=./chr1.1kg.phase3.v5a.b37.bref3 chrom=1

The genetic map from hapmap, and the reference from 1000genomes.

The resulting "phased" VCF from beagle differs greatly from the one I got from the trio phasing. Anyone knows any parameter tunning to apply in order to have the proper phased output? I tried larger window (up to 100Cm), more iterations (up to 120), larger overlap (up to 5Cm), with no good results.

I tried to reduce the reference human assembly, extraction only the positions that are present in the source VCF, using bedtools:

in 1st place I uncompress the bref3:

java -jar unbref3.28Jun21.220.jar chr1.1kg.phase3.v5a.b37.bref3 > chr1.1kg.phase3.v5a.b37.vcf

the I extract the intersection between this VCF and the source file:

bedtools intersect -b source.vcf -a chr1.1kg.phase3.v5a.b37.vcf > reduced.chr1.1kg.phase3.v5a.b37.vcf

at a last step I ran beagle again:

java -Xmx4g -jar beagle.28Jun21.220.jar impute=false gt=source.vcf map=./hapmap/plink.chr1.GRCh37.map out=out iterations=40 ref=./reduced.chr1.1kg.phase3.v5a.b37.bref3 chrom=1

but the "phase" output still different form the phased data confirmed by trio.

Any clues or suggestions? Thank you in advance.


PS: This is an extract from the source VCF:

##FILTER=<ID=PASS,Description="All filters passed">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  arivcf
1       82154   rs4477212       A       .       .       .       .       GT      0/0
1       752566  rs3094315       G       .       .       .       .       GT      ./.
1       752721  rs3131972       A       .       .       .       .       GT      0/0
1       768448  rs12562034      G       .       .       .       .       GT      ./.
1       776546  rs12124819      A       .       .       .       .       GT      ./.
1       798959  rs11240777      G       A       .       .       .       GT      1/0
1       800007  rs6681049       T       .       .       .       .       GT      ./.
1       838555  rs4970383       C       .       .       .       .       GT      0/0
1       846808  rs4475691       C       T       .       .       .       GT      0/1
1       854250  rs7537756       A       .       .       .       .       GT      0/0
1       861808  rs13302982      A       G       .       .       .       GT      0/1
1       873558  rs1110052       G       T       .       .       .       GT      0/1
1       882033  rs2272756       G       A       .       .       .       GT      1/0
beagle phasing microarray • 288 views
Entering edit mode

When you say 'The resulting "phased" VCF from beagle differs greatly from the one I got from the trio phasing' - how different are we talking? What kind of differences is there?

One option would be to use shapeit4 and integrate trio phasing with reference based phasing in one step.

Entering edit mode

The differences are phase flips every few consecutive heterozygous positions (between 2 and 10 positions). The genotype is OK but the phase flips compared with my phased information from the trio.

I can use shapeit4 for this particular case because I have the trio, but I need to tune the pipeline for standalone samples with no pedigree info, in order to make a later IBD detection with Refined IBD.

I also tried to process the sample with the Michigan Imputation server (against 1000G and HRC reference panels), and the output was even worse (they use Eagle 2.4).


Login before adding your answer.

Traffic: 1540 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6