Question

How to taxonomically subsample a proteomes file ?

0

Entering edit mode

2.7 years ago

Shaurya • 0

I have a proteomes .faa file with all the protein sequences encoded by LUCA. I want to create a file with the proteomes of only a few specific eukaryotes, few specific archaea and all bacteria. How do I do this ?

I have tried downloading the taxdmp.zip file from the NCBI database and but the nodes.dmp and names.dmp file is not making sense to me. I am an undergrad and I would appreciate any help

Suppose this is the first entry in my proteomes file.

1000565.METUNv1_00006 MFSYVSLEQRVPKDHPLRSLRALVDGILANMSALFDERYSHTG

so here 1000565 is the tax id of the organism that has the gene METUNv1_00006 and the line below it is the sequence of amino acids in the protein encoded.

the nodes.dmp file has this screenshot of the entry in nodes.dmp the first column is my organism and 5th column is the division

but there is no archaea in divisions.dmp

screenshot of divisions file

I understood that the names.dmp file is only needed to see the tax id of the specific eukaryotes and archaea I need

I dont understand how do i use this information to sample the proteomes of LUCA so that I have a subset of proteomes which I actually need. Do I simply write python code that will do it for me ? Or is there another tool that is used to taxonomically sample a huge proteomes file ?

proteomes faa taxonomy sampling • 804 views

ADD COMMENT • link 2.7 years ago by Shaurya • 0

0

Entering edit mode

nodes.dmp and names.dmp file is not making sense to me

What do you mean by not making sense?

What sort of protein ids/accession do you have?

ADD REPLY • link 2.7 years ago by Nitin Narwade ★ 1.6k

0

Entering edit mode

i have added details in my question. i hope it makes more sense now. sorry for the late response I was not having net connectivity

ADD REPLY • link 2.7 years ago by Shaurya • 0