Question

Ways to improve SnpEff prediction

1

Entering edit mode

5.7 years ago

Mohammed ▴ 10

Hi everyone,

I've used a variants calling pipeline to produce variants vcf file from non-model organism sequencing data. The vcf file have good variant numbers as I predicted, however, using SnpEff for the prediction seems to gave inaccurate number of effects in ann.vcf file. I've followed the manual instruction to build SnpEff database using two different ways:

sequences.fa + genes.gff file (with no intron or intergenic regions).
sequences.fa + genes.gtf file that converted from the previous gff file using gffread tool.

Both ways produced inaccurate number of effects in ann.vcf, but the second way gave less warnings with much better results. I've read previous post about producing a gtf file with only the longest transcript which did't solve my problem.

Anyone can help me?

Thank you all

Mohammed

SNP SnpEff • 2.5k views

ADD COMMENT • link updated 5.7 years ago by zx8754 12k • written 5.7 years ago by Mohammed ▴ 10

0

Entering edit mode

the prediction seems to gave inaccurate number of effects in ann.vcf file.

Can you elaborate in more detail?

ADD REPLY • link 5.7 years ago by grey ▴ 40

0

Entering edit mode

Number of modifiers are higher than variants number, and gtf file doesn't contain intron or intergenic regions check the following output:

Genome  GCA_000767585.1
Date    2019-10-24 10:58
SnpEff version  
SnpEff 4.3t (build 2017-11-24 10:18), by Pablo Cingolani
Command line arguments  
SnpEff  -classic -formatEff GCA_000767585.1 
bedtools_out/overlapped_variants.vcf 
Warnings    277,445
Errors  0
Number of lines (input file)    3,285,995
Number of variants (before filter)  3,307,740
Number of not variants 
(i.e. reference equals alternative) 0
Number of variants processed 
(i.e. after filter and non-variants)    3,307,740
Number of known variants 
(i.e. non-empty ID) 0 ( 0% )
Number of multi-allelic VCF entries 
(i.e. more than two alleles)    21,745
Number of effects   3,713,809
Genome total length 2,004,047,047
Genome effective length 1,993,779,170
Variant rate    1 variant every 602 bases

HIGH        2,086   0.056%
LOW     19,376  0.522%
MODERATE        17,167  0.462%
MODIFIER        3,675,180   98.96%

CODON_CHANGE_PLUS_CODON_DELETION        197 0.005%
CODON_CHANGE_PLUS_CODON_INSERTION       230 0.006%
CODON_DELETION      182 0.005%
CODON_INSERTION     234 0.006%
DOWNSTREAM      198,619 5.341%
EXON        8,618   0.232%
EXON_DELETED        4   0%
FRAME_SHIFT     1,235   0.033%
GENE_FUSION     1   0%
INTERGENIC      2,152,231   57.871%
INTRAGENIC      17,704  0.476%
INTRON      1,092,449   29.375%
NON_SYNONYMOUS_CODING       16,386  0.441%
NON_SYNONYMOUS_START        5   0%
SPLICE_SITE_ACCEPTOR        389 0.01%
SPLICE_SITE_DONOR       452 0.012%
SPLICE_SITE_REGION      4,215   0.113%
START_GAINED        380 0.01%
START_LOST      33  0.001%
STOP_GAINED     247 0.007%
STOP_LOST       116 0.003%
SYNONYMOUS_CODING       16,075  0.432%
SYNONYMOUS_STOP     17  0%
TRANSCRIPT      59  0.002%
UPSTREAM        194,068 5.218%
UTR_3_DELETED       1   0%
UTR_3_PRIME     11,901  0.32%
UTR_5_DELETED       2   0%
UTR_5_PRIME     2,965   0.08%

ADD REPLY • link updated 5.7 years ago by Pierre Lindenbaum 166k • written 5.7 years ago by Mohammed ▴ 10

0

Entering edit mode

Number of modifiers are higher than variants number,

of course, there is more than on prediction per variant (alternative transcripts, etc...)

ADD REPLY • link 5.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

gtf file doesn't contain intron or intergenic regions check the following output:

they are inferred from the exons and the genes...

ADD REPLY • link 5.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

How to resolve this and get a better prediction?

ADD REPLY • link 5.7 years ago by Mohammed ▴ 10

0

Entering edit mode

what do you want to resolve ??

ADD REPLY • link 5.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I'm trying to mimic a published work to verify the variant calling pipeline, I've downloaded the raw sequencing data and managed to do the same steps they did: 1) mapping to the reference genome (I've got the same results). 2) create realignment targets and realign around indels. 3) apply Base Quality Score Reclabration. 4) calling variant (HaplotypeCaller).

I've got a vcf file that have variants (SNPs and Indels) near to the publication. However, after following the manual instruction for building a database in SnpEff, I had a variant effect predication that is different than the publication.

ADD REPLY • link 5.7 years ago by Mohammed ▴ 10

0

Entering edit mode

Did you filter your vcf ?

ADD REPLY • link 5.7 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Yes, I will give example for the difference. I had STOP_GAINED: 247, and published work STOP_GAINED: 1343. I've got differences in more other annotations.

ADD REPLY • link 5.7 years ago by Mohammed ▴ 10

0

Entering edit mode

how did you filter your vcf ? Did you use the same filters as the original paper ?

ADD REPLY • link 5.7 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Yes, using bcftools, but even before the filter I've tested SnpEff in the raw variants and I'm still getting lower numbers of STOP_GAINED many other effects and higher numbers in other effects (Modifier)

ADD REPLY • link 5.7 years ago by Mohammed ▴ 10

0

Entering edit mode

Please add comment via Add comment. THe answer box is intended for answers. That will keep the thread logically organized.

ADD REPLY • link 5.7 years ago by ATpoint 88k