Question

Where do these snpeff annotation come from?

0

Entering edit mode

11 months ago

curious ▴ 810

I am annotating a VCF with annotation from snpeff, which I want to use eventually to parse for predicted loss of function variants

I want to understand the annotation better and document how they are happening.

I run this command:

snpEff "hg38" -lof {input}

From what I read in the docs hg38 is

hg38: UCSC genome with RefSeq transcripts mapped to GRCh38/hg38 reference genome sequence

When I run snpEff databases | grep "hg38

hg38     Homo_sapiens (UCSC)    OK [https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_hg38.zip, https://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_hg38.zip]

Which further supports this is UCSC

I think when I run snpEff it is calling "hg38" from here: ~/miniconda3/envs/share/snpeff-5.1-2/data/hg38, which contains these files:

cytoBand.txt.gz                 sequence.15_KI270905v1_alt.bin  sequence.1.bin                 sequence.5_KI270897v1_alt.bin  sequence.7.bin
pwms.bin                        sequence.16.bin                 sequence.20.bin                sequence.6.bin                 sequence.8.bin
sequence.10.bin                 sequence.16_KI270853v1_alt.bin  sequence.21.bin                sequence.6_GL000250v2_alt.bin  sequence.9.bin
sequence.11.bin                 sequence.17.bin                 sequence.22.bin                sequence.6_GL000251v2_alt.bin  sequence.bin
sequence.12.bin                 sequence.17_GL000258v2_alt.bin  sequence.2.bin                 sequence.6_GL000252v2_alt.bin  sequence.X.bin
sequence.13.bin                 sequence.17_KI270857v1_alt.bin  sequence.3.bin                 sequence.6_GL000253v2_alt.bin  sequence.Y.bin
sequence.14.bin                 sequence.17_KI270908v1_alt.bin  sequence.4.bin                 sequence.6_GL000254v2_alt.bin  snpEffectPredictor.bin
sequence.14_KI270847v1_alt.bin  sequence.18.bin                 sequence.5.bin                 sequence.6_GL000255v2_alt.bin
sequence.15.bin                 sequence.19.bin                 sequence.5_GL339449v2_alt.bin  sequence.6_GL000256v2_alt.bin

Which are binary for the most part and I can't really tell what is happening.

I have a few questions:

Is any of my understanding above off?
How can I tell which version of RefSeq is being used? Wouldn't these be updated over time as new splice sites etc are discovered.
Is RefSeq even desirable from looking at predicted loos of function or are there other annotation systems, eg MANE, gencode, ensembl that the field is adopting. It seems like MANE might be the best to work with if the cost of being wrong is high as it pulls together annotations from Reqseq and ensembl iiuc

snpeff • 653 views

ADD COMMENT • link updated 11 months ago by Istvan Albert 101k • written 11 months ago by curious ▴ 810

score 0 · Answer 1 · 2023-11-30

in principle you can do a

snpeff dump hg38

that will generate a text output of the content of the database. in practice when I do the above I get

snpeff dump hg38 | more       
java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
        at org.snpeff.interval.Intron.createSpliceSiteDonor(Intron.java:104)
        at org.snpeff.interval.Transcript.createSpliceSites(Transcript.java:713)
        at org.snpeff.interval.Genes.createSpliceSites(Genes.java:129)

... LOL ... we can't even unpack the database without a memory error, oh well let's bump up that snpeff memory then

snpeff -Xmx4g dump hg38 | more

among the information we can find the accession ids like NR_024540.1

so in the end it is not the release of refseq that matters but the version of the locus that has the version .1 number associated with it.