I'm working on building a snpEff db for GRCh38 patch 13 RefSeq assembly. The latest pre-built snpEff db available for RefSeq is patch 7. I need p13 for consistency within pipelines. (don't ask me why patch 13 isn't available in the pre-built snpEff library. They recommend using Ensembl not RefSeq) (version 4.3t latest)
I'm able to follow the documentation and build a db from the RefSeq GTF and FASTA files from NCBI. However there are still problems:
(sorry, I deleted a portion of this question because I figured out I was running the command on another server and the config wasn't synced)
Entries from my snpEff.config file:
#data.dir = ./data/ data.dir = /var/references/snpEff/ ... # GRCh38 current release from NCBI's RefSeq should be p13 not p7 GRCh38.p13.RefSeq.genome : Human genome GRCh38 using RefSeq transcripts #GRCh38.p13.RefSeq.reference : ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/ GRCh38.p13.RefSeq.M.codonTable : Vertebrate_Mitochondrial GRCh38.p13.RefSeq.MT.codonTable : Vertebrate_Mitochondrial
Files within /var/references/snpEff/GRCh38.p13.RefSeq/
genes.gtf -> /var/references/ncbi/homo_sapiens/GRCh38/13/annotations/full.gtf sequences.fa -> /var/references/ncbi/homo_sapiens/GRCh38/13/sequences/full.fa snpEffectPredictor.bin
This .bin file was built using the snpEff build process in their documentation, using NCBI's RefSeq GTF and Fasta.
Should I build the contigs individually? The internal RefSeq dbs seem to be built as individual contigs, not a single one lumped together.
What about protein and regulatory regions? I'm working on those next. Are they required to run snpEff?