I have a large table where each column contains protein IDs of a particular group of orthologs. How do I map these protein IDs to gene IDs and then get a file with fasta sequences of all genes for each column?
can you use Ensembl/BioMart?
how can I bake a pie?
let's put a bit more flesh to that bone
Always provide a few examples when asking this type of question. Protein ID's could be anything and the answer will depend on what kind they are.
NCBI's unix utils would almost certainly work if the ID's are from GenBank.
Oh, my bad. All IDs are from GenBank Escherichia genome assemblies (.faa files). For example, AAN78512.1, BAB33431.1, BAB33432.1.
P.S. I know that I can simply go to NCBI and get CDS for each protein manually but the question is how to do this for a large number of ID groups. I've heard something about EDirect but maybe there is a common way to do this with one line.
If you need to get all CDS's for E. coli O157:H7 then those are available here. If the ID's are from different genomes then it is a different problem. Let me look into it some.
IDs are from different genomes. In fact, I have a table with protein IDs:
group1 group2 group3 group4 ...
bac1 ID1 ID2 ID3 ID4
bac2 ID5 ID6 ID7 ID8
bac3 ID9 ID10 ID11 ID12
and I need to get a file with fasta sequences of CDS for each group.
Suppose I have all .fna assembly files. Could I use BioPython to get the files?
efetch -db protein -format fasta_cds_na -id AAN78512
edit: works the same with:
efetch -db protein -format fasta_cds_na -id AAN78512.1
Thank you! But is it possible to use the command for > 500 IDs? Documentations says 'a comma-delimited list of UIDs may be provided... but if more than about 200 UIDs are to be provided, the request should be made using the HTTP POST method'.
You could run the efetch command via a loop. Be sure to sign up for an NCBI API_KEY and use it. Use discretion when sending in those queries so as to not get IP banned.
When I try to run the same command, efetch does not take any action but just prints out the help.
Any clue why this happens?
this can have many reasons, the most frequent problem is a typo. In case you want more profound help, please post your exact command here. Please use the 101010 code formatting button (fifth in the ribbon above)