How to taxonomically subsample a proteomes file ?
Entering edit mode
19 months ago
Shaurya • 0

I have a proteomes .faa file with all the protein sequences encoded by LUCA. I want to create a file with the proteomes of only a few specific eukaryotes, few specific archaea and all bacteria. How do I do this ?

I have tried downloading the file from the NCBI database and but the nodes.dmp and names.dmp file is not making sense to me. I am an undergrad and I would appreciate any help

Suppose this is the first entry in my proteomes file.


so here 1000565 is the tax id of the organism that has the gene METUNv1_00006 and the line below it is the sequence of amino acids in the protein encoded.

the nodes.dmp file has this screenshot of the entry in nodes.dmp the first column is my organism and 5th column is the division

but there is no archaea in divisions.dmp

screenshot of divisions file

I understood that the names.dmp file is only needed to see the tax id of the specific eukaryotes and archaea I need

I dont understand how do i use this information to sample the proteomes of LUCA so that I have a subset of proteomes which I actually need. Do I simply write python code that will do it for me ? Or is there another tool that is used to taxonomically sample a huge proteomes file ?

proteomes faa taxonomy sampling • 592 views
Entering edit mode

nodes.dmp and names.dmp file is not making sense to me

What do you mean by not making sense?

What sort of protein ids/accession do you have?

Entering edit mode

i have added details in my question. i hope it makes more sense now. sorry for the late response I was not having net connectivity


Login before adding your answer.

Traffic: 2222 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6