I've a long list of RefSeq mRNA Ids for a particular organism. I wish to download all the corresponding coding sequences(CDS) in fasta format, where available. Is their any suitable tool or script for automatically doing this?
Thanks in advance
WoA
I've a long list of RefSeq mRNA Ids for a particular organism. I wish to download all the corresponding coding sequences(CDS) in fasta format, where available. Is their any suitable tool or script for automatically doing this?
Thanks in advance
WoA
Normally I would suggest BioMart for this purpose (assuming that your organism is in BioMart) but as I write, it is giving an error. However, here's the procedure for when they fix it:
Currently, this gives the error "Serious Error: Error during query execution: Table 'ensembl_mart_64.ox_RefSeq_mRNA__dm' doesn't exist" - I will report this to BioMart.
Message from Ensembl: "This is a known bug in BioMart for release 64. See the known bugs page here: http://www.ensembl.info/contact-us/known-bugs/. This bug will be fixed for release 65 due out in November."
If you're willing to try an in-development library, you can try cruzdb. With a script like this:
from cruzdb import Genome
db = Genome('hg19')
refGene = db.refGene
for name in (n.strip() for n in open("names.txt")):
gene = refGene.filter_by(name=name).one()
print ">%s" % name
print "".join(gene.cds_sequence)
and names.txt containing id's like: NM_001127388 NM_001127389
It will create print FASTA file by querying the UCSC genomes database (refGene table), and grabbing sequence from their DAS sequence server.
If you have a long list, see the notes on the cruzdb page about mirroring the MySQL pages locally.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Which organism?
Mouse(Mus Musculus)