Entering edit mode
16 days ago
analyst
▴
60
I am building snpeff database for NCBI refseq genome. I dwnloaded genes.gtf genes.gff cds.fa protein.fa and sequences.fa from same ftp link.
But i got this error while running either of following command:
java -Xmx20g -jar snpEff.jar build -gff3 -v genome_v1
java -Xmx20g -jar snpEff.jar build -v genome_v1
FATAL ERROR: No CDS checked. This might be caused by differences in
FASTA file transcript IDs respect to database's transcript's IDs.
Transcript IDs from database (sample): 'XM_006431091.2'
'XM_024182252.1' 'XM_024182438.1' 'XM_024182439.1'
'XM_006429412.2' 'XM_024182149.1' 'XM_006430710.2'
'XM_024182060.1' 'XM_006428645.2' 'XM_006445612.2'
'XM_006445611.2' 'XM_006445615.2' 'XM_006430691.2'
'XM_006430692.2' 'XM_024182465.1' 'XM_006429470.2'
'XM_006429473.2' 'XM_006446606.2' 'XM_024182649.1'
'XM_006446610.2' 'XM_006428812.2' 'XM_006428813.2'
Transcript IDs from database (fasta file):
'lcl|NW_006263303.1_cds_XP_006452193.1_1970'
'lcl|NW_006262339.1_cds_XP_006440622.1_9102' '2_24440'
'lcl|NW_006262139.1_cds_XP_024038273.1_21673'
'lcl|NW_006262339.1_cds_XP_006442058.1_10476'
'lcl|NW_006262022.1_cds_XP_006423168.1_27795'
'lcl|NW_006263303.1_cds_XP_006451035.1_891'
'lcl|NW_006262688.1_cds_XP_006448013.1_5822'
'lcl|NW_006262339.1_cds_XP_006444052.1_12383'
'lcl|NW_006262339.1_cds_XP_006443503.1_11819'
'lcl|NW_006262688.1_cds_XP_024046435.1_3831'
'lcl|NW_006263303.1_cds_XP_024033779.1_2199'
'lcl|NW_006263303.1_cds_XP_006453434.1_3099'
'lcl|NW_006262274.1_cds_XP_024042020.1_15390' '2_6397'
'lcl|NW_006262274.1_cds_XP_024042047.1_15690'
'lcl|NW_006262688.1_cds_XP_006449794.1_7524'
'lcl|NW_006262339.1_cds_XP_006445476.2_13697'
'lcl|NW_006262201.1_cds_XP_006433744.1_19705'
'lcl|NW_006262688.1_cds_XP_006445695.1_3582'
'lcl|NW_006263303.1_cds_XP_006450104.1_40'
'lcl|NW_006262274.1_cds_XP_006439100.1_17079'
My question is why this error when we download files from same source same version? Is it because I am using NCBI refseq files ?
Can you indicate which specific files you downloaded from that link?
I had a quick look and though it is feasible to change files (and specifically feature names) such that the files correspond to each other (eg. removing the part from the cds ID up to 1_ , before the cds part) I suggest to contact the NCBI staff and ask why these does not correspond and/or what files you need to get to have a nice linked set of IDs.
That is given that you did the snpEff part commands and config correctly. Perhaps you made a mix-up in the config setttings?
Is it safe to add -noCheckCds -noCheckProtein in the command or will it skip important information in the resultant database:
Running above command completed successfully with few warnings.
Your valuable suggestions please?
That certainly is an option indeed. What those checks will do is to check if the sequence they derive from parsing the GTF/GFF file is compliant with the info that is provided in the fasta file version for the CDS and protein. And likely flag them if they don't but they should not do anything otherwise
So if you are confident enough in the info from the GTF/GFF file you can indeed omit those checks.
I used -noCheckCds -noCheckProtein parameters because I was getting this error:
Please provide your suggestions should i used built-in database for annotation of my variants? I was building database because built in database is for old version and i used updated version of my genome for analysis
If the cmdline you mentioned above is working for you, then yes go ahead with that one!
If the version you are now using of your genome is quite different from the one in the DB you will be better off to build your own updated DB indeed, for instance if the genome and the corresponding gene annotation has had a major update you likely will have to build your own version (if the changes are only minor, you could get away with the old version I guess).
If you decide to go for the old built-in version I suggest you do all the pre-processing steps of your analysis again (eg. read mapping etc) using the same version as the one in the built-in DB.
Dear lieven i renamed the headers in my cds fasta file: now i don't get that CDS not found fatal error:
but still warning:
Is it okay to have that warning?
goh, if there are not too many of those it could be fine indeed.
however, I see that you nearly got 50% of 'Not found' ones, that sounds too many unfortunately.