Question: Is there a tool or method for phasing a VCF based on an already phased BAM?
0
gravatar for olavur
5 months ago by
olavur70
Tórshavn, Faroe Islands
olavur70 wrote:

I have some BAM files where the reads are phased, such that each read belongs to haplotype 1 or 2 (or NA if the read isn't phased). Next I call variants in these BAMs, and obtain an unphased VCF. The question is: is there a method, algorithm, software, anything, that allows me to phase the variants in the VCF based on the phasing information in the BAM?

Edit:

My explanation above probably wasn't clear enough. I don't want to phase the BAM, the BAM is already phased. I basically want to use the phasing in the BAM to infer the phasing of the VCF.

The BAM is phased using a linked-read technology from 10x Genomics. Basically, DNA fragments are barcoded such that two reads that have the same barcode are close together, and this gives the information to, among other things, phase the data.

Edit 2:

@Vitis answer actually solves my problem, although it isn't exactly what I was looking for initially.

With WhatsHap, you can phase a VCF by providing a BAM file and optionally a phased VCF file. WhatsHap will phase the input VCF with read-backed phasing and use haplotypes defined in the phased VCF as well. So what I will do is actually phase my VCF based on a different already phased VCF of the same sample.

If you do have the problem I posed originally, and don't have a phased VCF, I think there is a potential work-around. If you can extract haplotypes, or phase-blocks, from your phased BAM, you could write them to a BAM as fictitious reads, and have WhatsHap phase your input VCF based on that.

bam phasing haplotype vcf • 470 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by olavur70
2
gravatar for Vitis
5 months ago by
Vitis2.1k
New York
Vitis2.1k wrote:

Whatshap is designed for this. It works a bit differently, requiring a solid, accurate VCF with variants to phase and a BAM (not phased), best from long-read technologies.

https://whatshap.readthedocs.io/en/latest/

ADD COMMENTlink written 5 months ago by Vitis2.1k

This tool does not do the task I need. I edited the question with a clarification.

ADD REPLYlink written 5 months ago by olavur70

10X blocks only phases within the limits of 10X linked reads, if you need to go beyond the limits, you still need tools to bridge the 10X phased blocks. In Whatshap paper

https://www.biorxiv.org/content/biorxiv/early/2016/11/14/085050.full.pdf

It explicitly stats:

Instead of BAM, phasing information can be provided to WhatsHap also as phased VCF files, such as those generated by 10X Genomics’ phasing pipeline, and already phased blocks are then treated as “reads”......Input reads – even those from a single sample – can be distributed over several BAM (or phased VCF) files, and may use different sequencing technologies. For example, PacBio, Illumina mate-pair reads, and 10X Genomics phased blocks can be used simultaneously for phasing.

If you only need to phase within the scope of the blocks generated by 10X phasing pipeline, a simple pysam query would go through the phased BAM and report phased genotypes within the phased read/block.

ADD REPLYlink modified 5 months ago • written 5 months ago by Vitis2.1k

I'm looking into the paper and the docs. It sounds really promising, thank you.

ADD REPLYlink written 5 months ago by olavur70

Hey @olavur, any updates on the WhatsHap phasing on the 10X data? I used it and was able to phase around 1/8th of the variants that were left unphased by the 10X phasing pipeline. The variants that remain still have significant amount of reads supporting them (according to IGV) so I am wondering if I did something wrong or if WhatsHap will not phase all of the "10X PHASING INCONSISTENT" variants. Thanks.

ADD REPLYlink written 3 months ago by suluxan0

I'm trying it just now, but don't have any results yet.

I doubt WhatsHap looks at the "10X PHASING INCONSISTENT" tag, it is specific to 10x and WhatsHap accepts any VCF phased by any method.

You could try to visualize the haplotype/phase blocks in IGV. The fact that the variant has read coverage doesn't help, the variant needs to be associated with a haplotype that connects it to other variants via phase blocks.

Did you supply a BAM and phased VCF to WhatsHap? I supplied only the VCF, and it's not working; it detects no "reads", and isn't able to phase any variants. I wonder, if you supplied a BAM and a phased VCF, that maybe WhatsHap is only using the reads in the BAM to phase the input VCF, that is, via traditional read-based phasing.

How many variants are in the input VCF and in the phased VCF? If there are a lot more variants called in the input VCF, that might explain part of the discrepancy. Although if the variants are in the same regions I would expect the phase blocks to overlap anyway.

Is it whole-genome or whole-exome data? I'm working with whole-exome.

ADD REPLYlink written 3 months ago by olavur70

If you want to look at the phase blocks in IGV, you can use the whatshap stats [VCF] --gtf [GTF] command (see here).

ADD REPLYlink modified 3 months ago • written 3 months ago by olavur70

Good point about the 10x tag - I guess I was wondering if these specific variants are problematic in general. I did supply a 10x linked reads bam and phased VCF processed by longranger. I assumed it would use the phased variants in the vcf due to the PS tags already present.

There is a section in the docs titled "Using a phased VCF instead of a BAM/CRAM file". Is this what you tried? I could try providing an unphased vcf along with phased vcf instead of bam; as you say it might be using the reads in just the bam.

Same amount of variants in the input (phased vcf) variants and the phased vcf output from WhatsHap (not sure what you meant by input/phased vcfs because I used bam+phased vcf).

The data I am testing with is the whole genome from the 10x website (NA12878 Germline Genome v2).

ADD REPLYlink written 3 months ago by suluxan0

Thanks this is useful! Here's what I got:

Pre-WhatsHap :

---------------- Chromosome chr21 ---------------- Variants in VCF: 70664 Heterozygous: 48405 ( 48405 SNVs) Phased: 40761 ( 40761 SNVs) Unphased: 7641 (not considered below) Singletons: 3 (not considered below) Blocks: 17

Post-WhatsHap:

---------------- Chromosome chr21 ---------------- Variants in VCF: 70664 Heterozygous: 48405 ( 48405 SNVs) Phased: 41684 ( 41684 SNVs) Unphased: 6719 (not considered below) Singletons: 2 (not considered below) Blocks: 6498

My concern is with the 6719 unphased variants. Did you have any success with the phased vcf+normal vcf method?

ADD REPLYlink modified 3 months ago • written 3 months ago by suluxan0

Can you email me at olavur(at)fargen.fo? Discussing this in the comments is becoming cumbersome.

ADD REPLYlink written 3 months ago by olavur70
1
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_phasing_ReadBackedPhasing.php

This tool identifies haplotypes based on the overlap between reads and uses this information to generate physical phasing information for variants within these haplotypes.

ADD COMMENTlink written 5 months ago by Pierre Lindenbaum119k

This tool does not do the task I need. I edited the question with a clarification.

ADD REPLYlink modified 5 months ago • written 5 months ago by olavur70

I basically want to use the phasing in the BAM to infer the phasing of the VCF.

and that's what the gatk tool does.

 java -jar GenomeAnalysisTK.jar \
      -T ReadBackedPhasing \
      -R reference.fasta \
      -I reads.bam \ # INPUT BAM <--------------------
      --variant SNPs.vcf \ # INPUT VCF <--------------------
      -L SNPs.vcf \
      -o phased_SNPs.vcf \ # OUTPUT VCF <--------------------
      --phaseQualityThresh 20.0*
ADD REPLYlink written 5 months ago by Pierre Lindenbaum119k

Will this take advantage of the fact that my BAM is already phased (each read is assigned to a haplotype)?

ADD REPLYlink written 5 months ago by olavur70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 880 users visited in the last hour