Question: (Closed) Number of SNPEff variants before filter is more than the actual variants number
0
gravatar for modi2020
3.7 years ago by
modi202020
United States
modi202020 wrote:

Hi everyone,

I used SNPeff to annotate horse SNPs and Indels and found a strange phenomena in the html output.
The number of variants (before filter) is larger than the number of lines (which I understand as being the number of SNPs or indels). That is, the number of variants that get analyzed by SNPeff is larger than the number of variants that are actually in the vcf file itself. I don't know where does SNPeff come up with the extra set of variants. I tried to understand this issue by myself but couldn't really wrap my head around it. I would have given an example HTML here but I couldn't see an option to upload one.  I would really appreciate any thoughts or ideas as to why this is the case ?.

Thank you

snpeff annotation • 1.6k views
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by modi202020

Hello modi2020!

Questions similar to yours can already be found at:

We have closed your question to allow us to keep similar content in the same thread.

If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.

Cheers!

ADD REPLYlink written 3.7 years ago by Daniel Swan13k
1
gravatar for Ram
3.7 years ago by
Ram17k
Houston, TX
Ram17k wrote:

Are you sure the number of lines you're counting corresponds to the number of variants? Perhaps you can compare the sorted sets of (position, REF, ALT) of the input and output files and ensure that each position is covered without any modification.

ADD COMMENTlink written 3.7 years ago by Ram17k
0
gravatar for modi2020
3.7 years ago by
modi202020
United States
modi202020 wrote:

Thank you for your prompt reply RamRs,

Just to clarify the matter a bit more. I am working with INDELs data called on GATK.

Also, the number of lines is 830,370, whereas the number of variants (before filter) is  841,444.

Now, I sorted both the input and output VCF files and counted the number of lines (without the header i.e not counting lines starting with #) both files agreed in that they contain 830,370 lines.Therefore, perhaps what SNPeff defines as a variant isn't necessarily corresponding to a line.

Also, SNPeff output shows that the Number of multi-allelic VCF entries (i.e. more than two alleles)  is 10,811. Even when I add this number to the number of lines in the vcf file that doesn't add up to 841,444 variants.

Now, the rest of SNPeff calculations is based on the 841,444 variants.

 

 

ADD COMMENTlink written 3.7 years ago by modi202020
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 748 users visited in the last hour