I am annotating a VCF with annotation from snpeff, which I want to use eventually to parse for predicted loss of function variants
I want to understand the annotation better and document how they are happening.
I run this command:
snpEff "hg38" -lof {input}
From what I read in the docs hg38 is
hg38: UCSC genome with RefSeq transcripts mapped to GRCh38/hg38 reference genome sequence
When I run snpEff databases | grep "hg38
hg38 Homo_sapiens (UCSC) OK [https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_hg38.zip, https://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_hg38.zip]
Which further supports this is UCSC
I think when I run snpEff it is calling "hg38" from here: ~/miniconda3/envs/share/snpeff-5.1-2/data/hg38
, which contains these files:
cytoBand.txt.gz sequence.15_KI270905v1_alt.bin sequence.1.bin sequence.5_KI270897v1_alt.bin sequence.7.bin
pwms.bin sequence.16.bin sequence.20.bin sequence.6.bin sequence.8.bin
sequence.10.bin sequence.16_KI270853v1_alt.bin sequence.21.bin sequence.6_GL000250v2_alt.bin sequence.9.bin
sequence.11.bin sequence.17.bin sequence.22.bin sequence.6_GL000251v2_alt.bin sequence.bin
sequence.12.bin sequence.17_GL000258v2_alt.bin sequence.2.bin sequence.6_GL000252v2_alt.bin sequence.X.bin
sequence.13.bin sequence.17_KI270857v1_alt.bin sequence.3.bin sequence.6_GL000253v2_alt.bin sequence.Y.bin
sequence.14.bin sequence.17_KI270908v1_alt.bin sequence.4.bin sequence.6_GL000254v2_alt.bin snpEffectPredictor.bin
sequence.14_KI270847v1_alt.bin sequence.18.bin sequence.5.bin sequence.6_GL000255v2_alt.bin
sequence.15.bin sequence.19.bin sequence.5_GL339449v2_alt.bin sequence.6_GL000256v2_alt.bin
Which are binary for the most part and I can't really tell what is happening.
I have a few questions:
- Is any of my understanding above off?
- How can I tell which version of RefSeq is being used? Wouldn't these be updated over time as new splice sites etc are discovered.
- Is RefSeq even desirable from looking at predicted loos of function or are there other annotation systems, eg MANE, gencode, ensembl that the field is adopting. It seems like MANE might be the best to work with if the cost of being wrong is high as it pulls together annotations from Reqseq and ensembl iiuc