Haplotype phasing with Beagle on microarray output VCF, tunning suggestions?
0
0
Entering edit mode
3 months ago
jpablo • 0

I'm doing phasing with beagle 5.2 on SNP data from illumina microarray.

Starting from an unphased VCF with around 600,000 SNPs.

I also trio-phased the same VCF, so I have a phased control VCF

I run a simple pipeline,

java -Xmx4g -jar beagle.28Jun21.220.jar impute=false gt=source.vcf map=./hapmap/plink.chr1.GRCh37.map out=out iterations=40 ref=./chr1.1kg.phase3.v5a.b37.bref3 chrom=1


The genetic map from hapmap, and the reference from 1000genomes.

The resulting "phased" VCF from beagle differs greatly from the one I got from the trio phasing. Anyone knows any parameter tunning to apply in order to have the proper phased output? I tried larger window (up to 100Cm), more iterations (up to 120), larger overlap (up to 5Cm), with no good results.

I tried to reduce the reference human assembly, extraction only the positions that are present in the source VCF, using bedtools:

in 1st place I uncompress the bref3:

java -jar unbref3.28Jun21.220.jar chr1.1kg.phase3.v5a.b37.bref3 > chr1.1kg.phase3.v5a.b37.vcf


the I extract the intersection between this VCF and the source file:

bedtools intersect -b source.vcf -a chr1.1kg.phase3.v5a.b37.vcf > reduced.chr1.1kg.phase3.v5a.b37.vcf


at a last step I ran beagle again:

java -Xmx4g -jar beagle.28Jun21.220.jar impute=false gt=source.vcf map=./hapmap/plink.chr1.GRCh37.map out=out iterations=40 ref=./reduced.chr1.1kg.phase3.v5a.b37.bref3 chrom=1


but the "phase" output still different form the phased data confirmed by trio.

Any clues or suggestions? Thank you in advance.

jp

PS: This is an extract from the source VCF:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  arivcf
1       82154   rs4477212       A       .       .       .       .       GT      0/0
1       752566  rs3094315       G       .       .       .       .       GT      ./.
1       752721  rs3131972       A       .       .       .       .       GT      0/0
1       768448  rs12562034      G       .       .       .       .       GT      ./.
1       776546  rs12124819      A       .       .       .       .       GT      ./.
1       798959  rs11240777      G       A       .       .       .       GT      1/0
1       800007  rs6681049       T       .       .       .       .       GT      ./.
1       838555  rs4970383       C       .       .       .       .       GT      0/0
1       846808  rs4475691       C       T       .       .       .       GT      0/1
1       854250  rs7537756       A       .       .       .       .       GT      0/0
1       861808  rs13302982      A       G       .       .       .       GT      0/1
1       873558  rs1110052       G       T       .       .       .       GT      0/1
1       882033  rs2272756       G       A       .       .       .       GT      1/0

beagle phasing microarray • 288 views
1
Entering edit mode

When you say 'The resulting "phased" VCF from beagle differs greatly from the one I got from the trio phasing' - how different are we talking? What kind of differences is there?

One option would be to use shapeit4 and integrate trio phasing with reference based phasing in one step.

0
Entering edit mode

The differences are phase flips every few consecutive heterozygous positions (between 2 and 10 positions). The genotype is OK but the phase flips compared with my phased information from the trio.

I can use shapeit4 for this particular case because I have the trio, but I need to tune the pipeline for standalone samples with no pedigree info, in order to make a later IBD detection with Refined IBD.

I also tried to process the sample with the Michigan Imputation server (against 1000G and HRC reference panels), and the output was even worse (they use Eagle 2.4).