I've a long list of RefSeq mRNA Ids for a particular organism. I wish to download all the corresponding coding sequences(CDS) in fasta format, where available. Is their any suitable tool or script for automatically doing this?
Thanks in advance
Normally I would suggest BioMart for this purpose (assuming that your organism is in BioMart) but as I write, it is giving an error. However, here's the procedure for when they fix it:
Currently, this gives the error "Serious Error: Error during query execution: Table 'ensembl_mart_64.ox_RefSeq_mRNA__dm' doesn't exist" - I will report this to BioMart.
If you're willing to try an in-development library, you can try cruzdb. With a script like this:
from cruzdb import Genome db = Genome('hg19') refGene = db.refGene for name in (n.strip() for n in open("names.txt")): gene = refGene.filter_by(name=name).one() print ">%s" % name print "".join(gene.cds_sequence)
names.txt containing id's like: NM_001127388 NM_001127389
It will create print FASTA file by querying the UCSC genomes database (refGene table), and grabbing sequence from their DAS sequence server.
If you have a long list, see the notes on the cruzdb page about mirroring the MySQL pages locally.