Entering edit mode
3 days ago
analyst
▴
60
I am building snpeff database for NCBI refseq genome. I dwnloaded genes.gtf genes.gff cds.fa protein.fa and sequences.fa from same ftp link.
But i got this error while running either of following command:
java -Xmx20g -jar snpEff.jar build -gff3 -v genome_v1
java -Xmx20g -jar snpEff.jar build -v genome_v1
FATAL ERROR: No CDS checked. This might be caused by differences in
FASTA file transcript IDs respect to database's transcript's IDs.
Transcript IDs from database (sample): 'XM_006431091.2'
'XM_024182252.1' 'XM_024182438.1' 'XM_024182439.1'
'XM_006429412.2' 'XM_024182149.1' 'XM_006430710.2'
'XM_024182060.1' 'XM_006428645.2' 'XM_006445612.2'
'XM_006445611.2' 'XM_006445615.2' 'XM_006430691.2'
'XM_006430692.2' 'XM_024182465.1' 'XM_006429470.2'
'XM_006429473.2' 'XM_006446606.2' 'XM_024182649.1'
'XM_006446610.2' 'XM_006428812.2' 'XM_006428813.2'
Transcript IDs from database (fasta file):
'lcl|NW_006263303.1_cds_XP_006452193.1_1970'
'lcl|NW_006262339.1_cds_XP_006440622.1_9102' '2_24440'
'lcl|NW_006262139.1_cds_XP_024038273.1_21673'
'lcl|NW_006262339.1_cds_XP_006442058.1_10476'
'lcl|NW_006262022.1_cds_XP_006423168.1_27795'
'lcl|NW_006263303.1_cds_XP_006451035.1_891'
'lcl|NW_006262688.1_cds_XP_006448013.1_5822'
'lcl|NW_006262339.1_cds_XP_006444052.1_12383'
'lcl|NW_006262339.1_cds_XP_006443503.1_11819'
'lcl|NW_006262688.1_cds_XP_024046435.1_3831'
'lcl|NW_006263303.1_cds_XP_024033779.1_2199'
'lcl|NW_006263303.1_cds_XP_006453434.1_3099'
'lcl|NW_006262274.1_cds_XP_024042020.1_15390' '2_6397'
'lcl|NW_006262274.1_cds_XP_024042047.1_15690'
'lcl|NW_006262688.1_cds_XP_006449794.1_7524'
'lcl|NW_006262339.1_cds_XP_006445476.2_13697'
'lcl|NW_006262201.1_cds_XP_006433744.1_19705'
'lcl|NW_006262688.1_cds_XP_006445695.1_3582'
'lcl|NW_006263303.1_cds_XP_006450104.1_40'
'lcl|NW_006262274.1_cds_XP_006439100.1_17079'
My question is why this error when we download files from same source same version? Is it because I am using NCBI refseq files ?
Can you indicate which specific files you downloaded from that link?
I had a quick look and though it is feasible to change files (and specifically feature names) such that the files correspond to each other (eg. removing the part from the cds ID up to 1_ , before the cds part) I suggest to contact the NCBI staff and ask why these does not correspond and/or what files you need to get to have a nice linked set of IDs.
That is given that you did the snpEff part commands and config correctly. Perhaps you made a mix-up in the config setttings?
Is it safe to add -noCheckCds -noCheckProtein in the command or will it skip important information in the resultant database:
Running above command completed successfully with few warnings.
Your valuable suggestions please?
That certainly is an option indeed. What those checks will do is to check if the sequence they derive from parsing the GTF/GFF file is compliant with the info that is provided in the fasta file version for the CDS and protein. And likely flag them if they don't but they should not do anything otherwise
So if you are confident enough in the info from the GTF/GFF file you can indeed omit those checks.