Question

snpEff error using custom database

0

Entering edit mode

21 months ago

khs960718 • 0

I use Streptococcus pneumoniae reference genome (NZ_CP020550) for snpEff database.

First, I downloaded reference data (sequences.fa, genes.gff, protein.fa, cds.fa) from NCBI (https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_002076835.1/) using the command below:

curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v1alpha/genome/accession/GCF_002076835.1/download?include_annotation_type=GENOME_GTF,GENOME_GFF,GENOME_GBFF,RNA_FASTA,CDS_FASTA,PROT_FASTA&filename=GCF_002076835.1.zip" -H "Accept: application/zip"

Datas were saved to snpEff/data/Streptococcus_pneumoniae_gcf_002076835/ snpEff.config file was edited.

#Streptococcus pneumoniae reference genome, gcf 002076835 gff
Streptococcus_pneumoniae_gcf_002076835.genome : Streptococcus_pneumoniae_gcf_002076835

then build database.

java -jar snpEff.jar build -gff3 -v Streptococcus_pneumoniae_gcf_002076835
00:00:00 SnpEff version SnpEff 5.1d (build 2022-04-19 15:49), by Pablo Cingolani
00:00:00 Command: 'build'
00:00:00 Building database for 'Streptococcus_pneumoniae_gcf_002076835'
00:00:00 Reading configuration file 'snpEff.config'. Genome: 'Streptococcus_pneumoniae_gcf_002076835'
00:00:00 Reading config file: /home/external/lys/younso/work/sgseq/pipeline/snpEff/snpEff.config
00:00:00 done
00:00:00 Reading GFF3 data file  : '/home/external/lys/younso/work/sgseq/pipeline/snpEff/./data/Streptococcus_pneumoniae_gcf_002076835/genes.gff'
00:00:00 Reading file '/home/external/lys/younso/work/sgseq/pipeline/snpEff/./data/Streptococcus_pneumoniae_gcf_002076835/genes.gff'
WARNING_TRANSCRIPT_NOT_FOUND: Exon's parent 'gene-SPNHU17_RS00005' is a Gene instead of a transcript. Created transcript 'TRANSCRIPT_gene-SPNHU17_RS00005' for NZ_CP020549.1    Protein Homology     CDS     196     1557    +
        dbxref : Genbank:WP_000660615.1,GeneID:66805161
        gbkey : CDS
        gene : dnaA
        go_function : DNA binding|0003677||IEA,DNA replication origin binding|0003688||IEA,ATP binding|0005524||IEA
        go_process : DNA replication initiation|0006270||IEA,regulation of DNA replication|0006275||IEA
        id : cds-WP_000660615.1
        inference : COORDINATES: similar to AA sequence:RefSeq:WP_004255267.1
        locus_tag : SPNHU17_RS00005
        name : WP_000660615.1
        ontology_term : GO:0006270,GO:0006275,GO:0003677,GO:0003688,GO:0005524
        parent : gene-SPNHU17_RS00005
        product : chromosomal replication initiator protein DnaA
        protein_id : WP_000660615.1
        source : Protein Homology
        transl_table : 11
        type : CDS
...
WARNING_GENE_NOT_FOUND: Gene 'null' (NZ_CP020549.1:20021-20825) does not include 'gene-SPNHU17_RS00280' (NZ_CP020549.1:45312-46643). Created new gene 'null.2' (NZ_CP020549.1:45312-46643). File '/home/external/lys/younso/work/sgseq/pipeline/snpEff/./data/Streptococcus_pneumoniae_gcf_002076835/genes.gff' line 129     'NZ_CP020549.1  RefSeq  pseudogene      45312
        46643   .       +       .       ID=gene-SPNHU17_RS00280;Dbxref=GeneID:66805216;Name=SPNHU17_RS00280;end_range=46643,.;gbkey=Gene;gene_biotype=pseudogene;locus_tag=SPNHU17_RS00280;old_locus_tag=SPNHU17_00055;partial=true;pseudo=true'
...

Then, when I used snpEff to make vcf file, error message were appeared.

java -jar ./snpEff/snpEff.jar Streptococcus_pneumoniae_gcf_002076835 work_out/workpath/filtered_vcf_file.vcf
00:00:00 ERROR while connecting to https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_Streptococcus_pneumoniae_gcf_002076835.zip
00:00:00 ERROR while connecting to https://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_Streptococcus_pneumoniae_gcf_002076835.zip
FATAL ERROR: Failed to download database from [https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_Streptococcus_pneumoniae_gcf_002076835.zip, https://snpeff.blob.core.windows.net/databases/v5_0/snpEff_v5_0_Streptococcus_pneumoniae_gcf_002076835.zip]

I don't know why snpEff didn't use custom database (/snpEff/data/"custom database")

I also built database using genebank file and ncbi scripts.

I followed building snpEff documents. for genebank: (https://pcingola.github.io/SnpEff/se_build_db/#step-2-option-2-building-a-database-from-genbank-files) for ncbi: (https://pcingola.github.io/SnpEff/se_faq/#how-to-building-an-ncbi-genome-genbank-file)

I downloaded genebank data (sequence.gb) from NCBI(https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP020550.1/) and used mv command to rename file and build database.

mv sequence.gb genes.gbk
java -jar snpEff.jar build -genbank -v Streptococcus_pneumoniae_gcf_002076835

And build ncbi database.

./scripts/buildDbNcbi.sh NZ_CP020550

But, the error message was same.

How to solve this problem?

snpEff database • 1.6k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 21 months ago by khs960718 • 0

score 0 · Answer 1 · 2022-07-27

0

Entering edit mode

21 months ago

LChart 3.9k

If you are looking to supplement the standard annotations with an extra custom database, then you are running SnpEff in the correct way, and you need to figure out how to ensure it can download the databases from the public URLs -- or place the files in the local SnpEff cache.

If you are looking to annotate only with your custom database, you can specify -noGenome which will disable loading of the standard databases, and should bypass this error.

ADD COMMENT • link 21 months ago by LChart 3.9k

0

Entering edit mode

Thanks, I used -noCheckCds and -noCheckProtein option and then worked.

you need to figure out how to ensure it can download the databases from the public URLs -- or place the files in the local SnpEff cache.

Following the URL, there are description how to download data. So I didn't describe in detail, sorry

ADD REPLY • link 21 months ago by khs960718 • 0