building snpeff database for plant
0
0
Entering edit mode
3 days ago
analyst ▴ 60

I am building snpeff database for NCBI refseq genome. I dwnloaded genes.gtf genes.gff cds.fa protein.fa and sequences.fa from same ftp link.

But i got this error while running either of following command:

java -Xmx20g -jar snpEff.jar build -gff3 -v genome_v1

java -Xmx20g -jar snpEff.jar build -v genome_v1

FATAL ERROR: No CDS checked. This might be caused by differences in
FASTA file transcript IDs respect to database's transcript's IDs.
Transcript IDs from database (sample):  'XM_006431091.2'
    'XM_024182252.1'    'XM_024182438.1'    'XM_024182439.1'
    'XM_006429412.2'    'XM_024182149.1'    'XM_006430710.2'
    'XM_024182060.1'    'XM_006428645.2'    'XM_006445612.2'
    'XM_006445611.2'    'XM_006445615.2'    'XM_006430691.2'
    'XM_006430692.2'    'XM_024182465.1'    'XM_006429470.2'
    'XM_006429473.2'    'XM_006446606.2'    'XM_024182649.1'
    'XM_006446610.2'    'XM_006428812.2'    'XM_006428813.2' 
Transcript IDs from database (fasta file):
    'lcl|NW_006263303.1_cds_XP_006452193.1_1970'
    'lcl|NW_006262339.1_cds_XP_006440622.1_9102'    '2_24440'
    'lcl|NW_006262139.1_cds_XP_024038273.1_21673'
    'lcl|NW_006262339.1_cds_XP_006442058.1_10476'
    'lcl|NW_006262022.1_cds_XP_006423168.1_27795'
    'lcl|NW_006263303.1_cds_XP_006451035.1_891'
    'lcl|NW_006262688.1_cds_XP_006448013.1_5822'
    'lcl|NW_006262339.1_cds_XP_006444052.1_12383'
    'lcl|NW_006262339.1_cds_XP_006443503.1_11819'
    'lcl|NW_006262688.1_cds_XP_024046435.1_3831'
    'lcl|NW_006263303.1_cds_XP_024033779.1_2199'
    'lcl|NW_006263303.1_cds_XP_006453434.1_3099'
    'lcl|NW_006262274.1_cds_XP_024042020.1_15390'   '2_6397'
    'lcl|NW_006262274.1_cds_XP_024042047.1_15690'
    'lcl|NW_006262688.1_cds_XP_006449794.1_7524'
    'lcl|NW_006262339.1_cds_XP_006445476.2_13697'
    'lcl|NW_006262201.1_cds_XP_006433744.1_19705'
    'lcl|NW_006262688.1_cds_XP_006445695.1_3582'
    'lcl|NW_006263303.1_cds_XP_006450104.1_40'
    'lcl|NW_006262274.1_cds_XP_006439100.1_17079'

My question is why this error when we download files from same source same version? Is it because I am using NCBI refseq files ?

plant • 370 views
ADD COMMENT
2
Entering edit mode

Can you indicate which specific files you downloaded from that link?

ADD REPLY
0
Entering edit mode
  1. sequences.fa (genomic.fna.gz file)
  2. genes.gff.gz (genomic.gff.gz file)
  3. genes.gtf.gz (genomic.gtf.gz file)
  4. protein.fa.gz (protein.faa.gz file)
  5. cds.fa.gz (cds_from_genomic.fna.gz file)
ADD REPLY
2
Entering edit mode

I had a quick look and though it is feasible to change files (and specifically feature names) such that the files correspond to each other (eg. removing the part from the cds ID up to 1_ , before the cds part) I suggest to contact the NCBI staff and ask why these does not correspond and/or what files you need to get to have a nice linked set of IDs.

That is given that you did the snpEff part commands and config correctly. Perhaps you made a mix-up in the config setttings?

ADD REPLY
0
Entering edit mode

Is it safe to add -noCheckCds -noCheckProtein in the command or will it skip important information in the resultant database:

java -Xmx20g -jar snpEff.jar build -gff3 -noCheckCds -noCheckProtein -v cit_clementina_v1

Running above command completed successfully with few warnings.

Your valuable suggestions please?

ADD REPLY
0
Entering edit mode

That certainly is an option indeed. What those checks will do is to check if the sequence they derive from parsing the GTF/GFF file is compliant with the info that is provided in the fasta file version for the CDS and protein. And likely flag them if they don't but they should not do anything otherwise

So if you are confident enough in the info from the GTF/GFF file you can indeed omit those checks.

ADD REPLY

Login before adding your answer.

Traffic: 2789 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6