Question

building snpeff database for plant

0

Entering edit mode

4 weeks ago

analyst ▴ 70

I am building snpeff database for NCBI refseq genome. I dwnloaded genes.gtf genes.gff cds.fa protein.fa and sequences.fa from same ftp link.

But i got this error while running either of following command:

java -Xmx20g -jar snpEff.jar build -gff3 -v genome_v1

java -Xmx20g -jar snpEff.jar build -v genome_v1

FATAL ERROR: No CDS checked. This might be caused by differences in
FASTA file transcript IDs respect to database's transcript's IDs.
Transcript IDs from database (sample):  'XM_006431091.2'
    'XM_024182252.1'    'XM_024182438.1'    'XM_024182439.1'
    'XM_006429412.2'    'XM_024182149.1'    'XM_006430710.2'
    'XM_024182060.1'    'XM_006428645.2'    'XM_006445612.2'
    'XM_006445611.2'    'XM_006445615.2'    'XM_006430691.2'
    'XM_006430692.2'    'XM_024182465.1'    'XM_006429470.2'
    'XM_006429473.2'    'XM_006446606.2'    'XM_024182649.1'
    'XM_006446610.2'    'XM_006428812.2'    'XM_006428813.2' 
Transcript IDs from database (fasta file):
    'lcl|NW_006263303.1_cds_XP_006452193.1_1970'
    'lcl|NW_006262339.1_cds_XP_006440622.1_9102'    '2_24440'
    'lcl|NW_006262139.1_cds_XP_024038273.1_21673'
    'lcl|NW_006262339.1_cds_XP_006442058.1_10476'
    'lcl|NW_006262022.1_cds_XP_006423168.1_27795'
    'lcl|NW_006263303.1_cds_XP_006451035.1_891'
    'lcl|NW_006262688.1_cds_XP_006448013.1_5822'
    'lcl|NW_006262339.1_cds_XP_006444052.1_12383'
    'lcl|NW_006262339.1_cds_XP_006443503.1_11819'
    'lcl|NW_006262688.1_cds_XP_024046435.1_3831'
    'lcl|NW_006263303.1_cds_XP_024033779.1_2199'
    'lcl|NW_006263303.1_cds_XP_006453434.1_3099'
    'lcl|NW_006262274.1_cds_XP_024042020.1_15390'   '2_6397'
    'lcl|NW_006262274.1_cds_XP_024042047.1_15690'
    'lcl|NW_006262688.1_cds_XP_006449794.1_7524'
    'lcl|NW_006262339.1_cds_XP_006445476.2_13697'
    'lcl|NW_006262201.1_cds_XP_006433744.1_19705'
    'lcl|NW_006262688.1_cds_XP_006445695.1_3582'
    'lcl|NW_006263303.1_cds_XP_006450104.1_40'
    'lcl|NW_006262274.1_cds_XP_006439100.1_17079'

My question is why this error when we download files from same source same version? Is it because I am using NCBI refseq files ?

plant • 823 views

ADD COMMENT • link updated 20 days ago by lieven.sterck 15k • written 4 weeks ago by analyst ▴ 70

2

Entering edit mode

Can you indicate which specific files you downloaded from that link?

ADD REPLY • link 4 weeks ago by lieven.sterck 15k

0

Entering edit mode

sequences.fa (genomic.fna.gz file)
genes.gff.gz (genomic.gff.gz file)
genes.gtf.gz (genomic.gtf.gz file)
protein.fa.gz (protein.faa.gz file)
cds.fa.gz (cds_from_genomic.fna.gz file)

ADD REPLY • link 4 weeks ago by analyst ▴ 70

2

Entering edit mode

I had a quick look and though it is feasible to change files (and specifically feature names) such that the files correspond to each other (eg. removing the part from the cds ID up to 1_ , before the cds part) I suggest to contact the NCBI staff and ask why these does not correspond and/or what files you need to get to have a nice linked set of IDs.

That is given that you did the snpEff part commands and config correctly. Perhaps you made a mix-up in the config setttings?

ADD REPLY • link 4 weeks ago by lieven.sterck 15k

0

Entering edit mode

Is it safe to add -noCheckCds -noCheckProtein in the command or will it skip important information in the resultant database:

java -Xmx20g -jar snpEff.jar build -gff3 -noCheckCds -noCheckProtein -v cit_clementina_v1

Running above command completed successfully with few warnings.

Your valuable suggestions please?

ADD REPLY • link 4 weeks ago by analyst ▴ 70

0

Entering edit mode

That certainly is an option indeed. What those checks will do is to check if the sequence they derive from parsing the GTF/GFF file is compliant with the info that is provided in the fasta file version for the CDS and protein. And likely flag them if they don't but they should not do anything otherwise

So if you are confident enough in the info from the GTF/GFF file you can indeed omit those checks.

ADD REPLY • link 29 days ago by lieven.sterck 15k

0

Entering edit mode

I used -noCheckCds -noCheckProtein parameters because I was getting this error:

FATAL ERROR: No CDS checked. This might be caused by differences in FASTA file transcript IDs respect to database's transcript's IDs.

ADD REPLY • link 28 days ago by analyst ▴ 70

0

Entering edit mode

Please provide your suggestions should i used built-in database for annotation of my variants? I was building database because built in database is for old version and i used updated version of my genome for analysis

ADD REPLY • link 28 days ago by analyst ▴ 70

2

Entering edit mode

If the cmdline you mentioned above is working for you, then yes go ahead with that one!

If the version you are now using of your genome is quite different from the one in the DB you will be better off to build your own updated DB indeed, for instance if the genome and the corresponding gene annotation has had a major update you likely will have to build your own version (if the changes are only minor, you could get away with the old version I guess).

If you decide to go for the old built-in version I suggest you do all the pre-processing steps of your analysis again (eg. read mapping etc) using the same version as the one in the built-in DB.

ADD REPLY • link 28 days ago by lieven.sterck 15k

0

Entering edit mode

Dear lieven i renamed the headers in my cds fasta file: now i don't get that CDS not found fatal error:

Protein check:  Citrus_clementina_v1    OK: 32133   Not found: 33015    Errors: 453 Error percentage: 1.3901675566194072%

but still warning:

WARNING_TRANSCRIPT_NOT_FOUND: Cannot find transcript
'rna-XM_006450018.2'. Created transcript 'rna-XM_006450018.2' and gene
'GENE_exon-XM_006450018.2-1' for
NW_006263303.1  Gnomon  EXON    130466  130611  -   dbxref :
GeneID:18056007,Genbank:XM_006450018.2  gbkey : mRNA    gene :
LOC18056007     id : exon-XM_006450018.2-1  parent : rna-XM_006450018.2
    product : serine/threonine-protein kinase STY13%2C transcript variant
X1  source : Gnomon     transcript_id : XM_006450018.2  type : exon .
File '/Apps/snpEff_latest_core/snpEff/./data/Citrus_clementina_v1/genes.gff'
line 403    'NW_006263303.1 Gnomon  exon    130467  130612.-    .   ID=exon-XM_006450018.2-1;Parent=rna-XM_006450018.2;Dbxref=GeneID:18056007,Genbank:XM_006450018.2;gbkey=mRNA;gene=LOC18056007;product=serine/threonine-protein kinase STY13%2C transcript variant X1;transcript_id=XM_006450018.2'

Is it okay to have that warning?

ADD REPLY • link updated 20 days ago by lieven.sterck 15k • written 20 days ago by analyst ▴ 70

0

Entering edit mode

goh, if there are not too many of those it could be fine indeed.

however, I see that you nearly got 50% of 'Not found' ones, that sounds too many unfortunately.

ADD REPLY • link 20 days ago by lieven.sterck 15k