Question: Discrepancy between VEP and IGV
gravatar for ijk8qd
2.5 years ago by
ijk8qd0 wrote:

I am in the process of analyzing the results of a GWAS with Drosophila Melanogaster and I've been finding discrepancies between variants identified by Ensembl's Variant Effect Predictor and what I find when I look at the same location in a genome browser (specifically Integrated Genome Viewer). Some of the variants do not show up at all whereas others do not have the effect predicted by VEP. For example, at one location VEP states that there will be a frameshift variant caused by the insertion of 2 bases (CAA/CAGCA). However, when looking into this location in IGV I find that there isn't a frameshift but rather a point mutation that changes CAA to CAG (the next two bases in the genome are CA). At other locations, it seems that VEP is using the wrong sequence entirely. For example, the VEP output claims that a point mutation will change ATT to GTT but IGV shows that instead there is a change from AAT to AAC; Neither the starting sequence nor the mutation match. My reads are aligned with the most recent Drosophila Melanogaster BDGP6 build genome. I have made sure that both VEP and IGV are running on the same genome. If anyone has any ideas as to why I may be encountering these issues I'd be grateful for the help.

sequencing vep igv gwas • 814 views
ADD COMMENTlink written 2.5 years ago by ijk8qd0

good description of issue. A little bit of example original data, VEP annotation of the same, genome version and/or IGV images would help. @ ijk8qd

ADD REPLYlink written 2.5 years ago by cpad011214k

screenshot + vcf line wanted please.

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum133k
gravatar for finswimmer
2.5 years ago by
finswimmer14k wrote:

Hello and welcome ijk8qd,

maybe I miss something. But the input for VEP is a variant list/file like vcf, isn't it? So you've done variant calling on your bam file before. The discrepancy than have nothing to do with VEP but with your variant caller.

The interesting question are now:

  • How did you do the variant calling
  • How does the vcf entry looks like for these position
  • A screenshot of IGV would be useful as well
ADD COMMENTlink written 2.5 years ago by finswimmer14k

Thanks for the response, the variant calling was done using GATK. The idea about the discrepancy being caused by this step is something I hadn't considered though it could be a possibility. I'm following a protocol worked out by someone else in my lab a few years ago so maybe I'm taking something for granted and therefore need to change some step along the way. Here's an example of the script I used for this step:


#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
#SBATCH --time=20:00:00
#SBATCH -p parallel
#SBATCH --account=hirsh
#SBATCH --mem=64000
java -Xmx64g -jar /scratch/ijk8qd/GenomeAnalysisTK.jar/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-nct 20 \
-R /scratch/ijk8qd/ref_genomes/B6_92/WholeGenome/WholeGenome.fa \
-I /scratch/ijk8qd/92_Analysis/Dedup/$current_sample.bam \
--emitRefConfidence GVCF \
--variant_index_type LINEAR \
--variant_index_parameter 128000 \
-o /scratch/ijk8qd/92_Analysis/VCF/$current_sample.vcf \

I'm not sure how to look at the vcf entry, it's a large file and I'm not entirely sure how to read it so if you have any tips for that I'd appreciate it. Here's the link for the IGV screenshot at the location of the first anomaly I mentioned (the CAGCA): Sorry the text is so small, the program gets formatted weirdly on my laptop. I'm looking at two bam files from that come from flies with differing activity levels and the bottom track is the reference genome.

ADD REPLYlink written 2.5 years ago by ijk8qd0

I'm not sure how to look at the vcf entry

The easiest way is to bgzip the vcf file, index it with tabix and use tabix to query for given region:

$ bgzip -c input.vcf > input.vcf.gz
$ tabix -p vcf input.vcf.gz
$ tabix input.vcf.gz <chr>:<from>-<to>

Replace <chr>, <from>, <to> with appropriate values.

Your script uses HaplotypeCaller for variant caller. This caller discards alignment informations in a region where it suspects a variant and is doing denovo assembly with this reads. So the resulting alignment can be different to this in the bam file. This is very likely in case of insertion/deletion. In the screenshot you have linked there is an insertion to see.

Due to this denovo assembly it is not always possible to compare the bam file directly with the vcf file. To do this you can follow this instruction. I would recommend to do it for just a given region and not for the whole alignment file.

fin swimmer

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by finswimmer14k

This seems to have worked! The bamout shows the correct insertion rather than a base switch: There are a lot of regions I would like to do this for, something that I think would become tedious doing it this way. Is there a different variant caller that would avoid this issue?

Also here's the region of the VCF file that matches those shown in the previous screenshot (X:5,352,662-5,352,702)

ADD REPLYlink written 2.4 years ago by ijk8qd0

Hello ijk8qd,

Is there a different variant caller that would avoid this issue?

This is not in issue but a big advantage of variant callers that do denovo reassembly during variant calling like HapoTypeCaller and freebayes. You can add the -bamout option to your normal variant calling command above and you will receive this "corrected" bam file additional to your vcf file for all variants.

Of course there are other variant callers that rely on the mapping information, e.g. bcftools mpileup/call

fin swimmer

ADD REPLYlink written 2.4 years ago by finswimmer14k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2531 users visited in the last hour