Is there a tool or method for phasing a VCF based on an already phased BAM?
2
0
Entering edit mode
2.4 years ago
olavur ▴ 130

I have some BAM files where the reads are phased, such that each read belongs to haplotype 1 or 2 (or NA if the read isn't phased). Next I call variants in these BAMs, and obtain an unphased VCF. The question is: is there a method, algorithm, software, anything, that allows me to phase the variants in the VCF based on the phasing information in the BAM?

Edit:

My explanation above probably wasn't clear enough. I don't want to phase the BAM, the BAM is already phased. I basically want to use the phasing in the BAM to infer the phasing of the VCF.

The BAM is phased using a linked-read technology from 10x Genomics. Basically, DNA fragments are barcoded such that two reads that have the same barcode are close together, and this gives the information to, among other things, phase the data.

Edit 2:

@Vitis answer actually solves my problem, although it isn't exactly what I was looking for initially.

With WhatsHap, you can phase a VCF by providing a BAM file and optionally a phased VCF file. WhatsHap will phase the input VCF with read-backed phasing and use haplotypes defined in the phased VCF as well. So what I will do is actually phase my VCF based on a different already phased VCF of the same sample.

If you do have the problem I posed originally, and don't have a phased VCF, I think there is a potential work-around. If you can extract haplotypes, or phase-blocks, from your phased BAM, you could write them to a BAM as fictitious reads, and have WhatsHap phase your input VCF based on that.

vcf bam phasing haplotype • 1.9k views
ADD COMMENT
2
Entering edit mode
2.4 years ago
Vitis ★ 2.4k

Whatshap is designed for this. It works a bit differently, requiring a solid, accurate VCF with variants to phase and a BAM (not phased), best from long-read technologies.

ADD COMMENT
0
Entering edit mode

This tool does not do the task I need. I edited the question with a clarification.

ADD REPLY
0
Entering edit mode

10X blocks only phases within the limits of 10X linked reads, if you need to go beyond the limits, you still need tools to bridge the 10X phased blocks. In Whatshap paper

https://www.biorxiv.org/content/biorxiv/early/2016/11/14/085050.full.pdf

It explicitly stats:

Instead of BAM, phasing information can be provided to WhatsHap also as phased VCF files, such as those generated by 10X Genomics’ phasing pipeline, and already phased blocks are then treated as “reads”......Input reads – even those from a single sample – can be distributed over several BAM (or phased VCF) files, and may use different sequencing technologies. For example, PacBio, Illumina mate-pair reads, and 10X Genomics phased blocks can be used simultaneously for phasing.

If you only need to phase within the scope of the blocks generated by 10X phasing pipeline, a simple pysam query would go through the phased BAM and report phased genotypes within the phased read/block.

ADD REPLY
0
Entering edit mode

I'm looking into the paper and the docs. It sounds really promising, thank you.

ADD REPLY
0
Entering edit mode

Hey @olavur, any updates on the WhatsHap phasing on the 10X data? I used it and was able to phase around 1/8th of the variants that were left unphased by the 10X phasing pipeline. The variants that remain still have significant amount of reads supporting them (according to IGV) so I am wondering if I did something wrong or if WhatsHap will not phase all of the "10X PHASING INCONSISTENT" variants. Thanks.

ADD REPLY
0
Entering edit mode

I'm trying it just now, but don't have any results yet.

I doubt WhatsHap looks at the "10X PHASING INCONSISTENT" tag, it is specific to 10x and WhatsHap accepts any VCF phased by any method.

You could try to visualize the haplotype/phase blocks in IGV. The fact that the variant has read coverage doesn't help, the variant needs to be associated with a haplotype that connects it to other variants via phase blocks.

Did you supply a BAM and phased VCF to WhatsHap? I supplied only the VCF, and it's not working; it detects no "reads", and isn't able to phase any variants. I wonder, if you supplied a BAM and a phased VCF, that maybe WhatsHap is only using the reads in the BAM to phase the input VCF, that is, via traditional read-based phasing.

How many variants are in the input VCF and in the phased VCF? If there are a lot more variants called in the input VCF, that might explain part of the discrepancy. Although if the variants are in the same regions I would expect the phase blocks to overlap anyway.

Is it whole-genome or whole-exome data? I'm working with whole-exome.

ADD REPLY
0
Entering edit mode

If you want to look at the phase blocks in IGV, you can use the whatshap stats [VCF] --gtf [GTF] command (see here).

ADD REPLY
0
Entering edit mode

Good point about the 10x tag - I guess I was wondering if these specific variants are problematic in general. I did supply a 10x linked reads bam and phased VCF processed by longranger. I assumed it would use the phased variants in the vcf due to the PS tags already present.

There is a section in the docs titled "Using a phased VCF instead of a BAM/CRAM file". Is this what you tried? I could try providing an unphased vcf along with phased vcf instead of bam; as you say it might be using the reads in just the bam.

Same amount of variants in the input (phased vcf) variants and the phased vcf output from WhatsHap (not sure what you meant by input/phased vcfs because I used bam+phased vcf).

The data I am testing with is the whole genome from the 10x website (NA12878 Germline Genome v2).

ADD REPLY
0
Entering edit mode

Thanks this is useful! Here's what I got:

Pre-WhatsHap :

---------------- Chromosome chr21 ---------------- Variants in VCF: 70664 Heterozygous: 48405 ( 48405 SNVs) Phased: 40761 ( 40761 SNVs) Unphased: 7641 (not considered below) Singletons: 3 (not considered below) Blocks: 17

Post-WhatsHap:

---------------- Chromosome chr21 ---------------- Variants in VCF: 70664 Heterozygous: 48405 ( 48405 SNVs) Phased: 41684 ( 41684 SNVs) Unphased: 6719 (not considered below) Singletons: 2 (not considered below) Blocks: 6498

My concern is with the 6719 unphased variants. Did you have any success with the phased vcf+normal vcf method?

ADD REPLY
0
Entering edit mode

Can you email me at olavur(at)fargen.fo? Discussing this in the comments is becoming cumbersome.

ADD REPLY
1
Entering edit mode
2.4 years ago

https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php

This tool identifies haplotypes based on the overlap between reads and uses this information to generate physical phasing information for variants within these haplotypes.

ADD COMMENT
0
Entering edit mode

This tool does not do the task I need. I edited the question with a clarification.

ADD REPLY
0
Entering edit mode

I basically want to use the phasing in the BAM to infer the phasing of the VCF.

and that's what the gatk tool does.

 java -jar GenomeAnalysisTK.jar \
-T ReadBackedPhasing \
-R reference.fasta \
-I reads.bam \ # INPUT BAM <--------------------
--variant SNPs.vcf \ # INPUT VCF <--------------------
-L SNPs.vcf \
-o phased_SNPs.vcf \ # OUTPUT VCF <--------------------
--phaseQualityThresh 20.0*

ADD REPLY
0
Entering edit mode

Will this take advantage of the fact that my BAM is already phased (each read is assigned to a haplotype)?

ADD REPLY

Login before adding your answer.

Traffic: 1212 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6