Dear all,
Please apologies if this has been answered somewhere else, but I couldn't find an answer to this problem.
I would like to retrieve all the predicted coding sequences on the NCBI ftp for a species. Let's say I go here. I know how to get all the predicted mRNAs (./RNA/Gnomon_mRNA.fsa) or all the predicted proteins (./protein/protein.fa) but I cannot find how to get the CDS... if ever it's possible? This can be done on the Ensembl FTP.
Thanks for any insight!
Thanks! But then I'd get the introns too, not the cds only?
Yes but they would be in lower case (if I recollect). You can remove them that way.
Alternative why not get the GFF file from here and then use the same bedtools getfasta method? You would need to figure out the longest transcript, which is what you probably want.
+1 for providing another method. I'm surprised though that it's not a built-in option! Would be more convenient :o)