Question: How to get fasta sequences for CDS if I have proteins IDs?
0
gravatar for little_more
2.0 years ago by
little_more40
little_more40 wrote:

I have a large table where each column contains protein IDs of a particular group of orthologs. How do I map these protein IDs to gene IDs and then get a file with fasta sequences of all genes for each column?

sequence gene • 737 views
ADD COMMENTlink modified 2.0 years ago by h.mon31k • written 2.0 years ago by little_more40

which organism? can you use Ensembl/BioMart? how can I bake a pie?

ADD REPLYlink written 2.0 years ago by JC12k

let's put a bit more flesh to that bone http://www.ensembl.org/biomart/martview/

ADD REPLYlink written 2.0 years ago by Carambakaracho2.2k

Always provide a few examples when asking this type of question. Protein ID's could be anything and the answer will depend on what kind they are.

NCBI's unix utils would almost certainly work if the ID's are from GenBank.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by genomax91k

Oh, my bad. All IDs are from GenBank Escherichia genome assemblies (.faa files). For example, AAN78512.1, BAB33431.1, BAB33432.1.

P.S. I know that I can simply go to NCBI and get CDS for each protein manually but the question is how to do this for a large number of ID groups. I've heard something about EDirect but maybe there is a common way to do this with one line.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by little_more40

If you need to get all CDS's for E. coli O157:H7 then those are available here. If the ID's are from different genomes then it is a different problem. Let me look into it some.

ADD REPLYlink written 2.0 years ago by genomax91k

IDs are from different genomes. In fact, I have a table with protein IDs:

           group1   group2   group3   group4   ... 
bac1          ID1      ID2      ID3      ID4
bac2          ID5      ID6      ID7      ID8
bac3          ID9     ID10     ID11     ID12
...

and I need to get a file with fasta sequences of CDS for each group. Suppose I have all .fna assembly files. Could I use BioPython to get the files?

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by little_more40
0
gravatar for h.mon
2.0 years ago by
h.mon31k
Brazil
h.mon31k wrote:

Try:

efetch -db protein -format fasta_cds_na -id AAN78512

edit: works the same with:

efetch -db protein -format fasta_cds_na -id AAN78512.1
ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by h.mon31k

Thank you! But is it possible to use the command for > 500 IDs? Documentations says 'a comma-delimited list of UIDs may be provided... but if more than about 200 UIDs are to be provided, the request should be made using the HTTP POST method'.

ADD REPLYlink written 2.0 years ago by little_more40

You could run the efetch command via a loop. Be sure to sign up for an NCBI API_KEY and use it. Use discretion when sending in those queries so as to not get IP banned.

ADD REPLYlink written 2.0 years ago by genomax91k

Hi! When I try to run the same command, efetch does not take any action but just prints out the help. Any clue why this happens?

ADD REPLYlink written 23 months ago by shubhra.bhattacharya120

this can have many reasons, the most frequent problem is a typo. In case you want more profound help, please post your exact command here. Please use the 101010 code formatting button (fifth in the ribbon above)

ADD REPLYlink modified 23 months ago • written 23 months ago by Carambakaracho2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 840 users visited in the last hour