How to taxonomically subsample a proteomes file ?
19 months ago
Shaurya • 0

I have a proteomes .faa file with all the protein sequences encoded by LUCA. I want to create a file with the proteomes of only a few specific eukaryotes, few specific archaea and all bacteria. How do I do this ?

I have tried downloading the file from the NCBI database and but the nodes.dmp and names.dmp file is not making sense to me. I am an undergrad and I would appreciate any help

Suppose this is the first entry in my proteomes file.


so here 1000565 is the tax id of the organism that has the gene METUNv1_00006 and the line below it is the sequence of amino acids in the protein encoded.

the nodes.dmp file has this screenshot of the entry in nodes.dmp the first column is my organism and 5th column is the division

but there is no archaea in divisions.dmp

screenshot of divisions file

I understood that the names.dmp file is only needed to see the tax id of the specific eukaryotes and archaea I need

I dont understand how do i use this information to sample the proteomes of LUCA so that I have a subset of proteomes which I actually need. Do I simply write python code that will do it for me ? Or is there another tool that is used to taxonomically sample a huge proteomes file ?

proteomes faa taxonomy sampling • 592 views
nodes.dmp and names.dmp file is not making sense to me

What do you mean by not making sense?

What sort of protein ids/accession do you have?

i have added details in my question. i hope it makes more sense now. sorry for the late response I was not having net connectivity


