Getting taxonomy information from PFAM
2
0
Entering edit mode
4.2 years ago
bef1 • 0

Given a protein family sequence alignment from PFAM, I want to get taxonomy information for each of the sequences. For example, for each sequence, I want to know whether it is eukaryote or prokaryote. How can I do this, in Python, Bash or other scriptable tool?

sequence alignment genome sequencing • 1.3k views
ADD COMMENT
0
Entering edit mode
4.2 years ago
Mensur Dlakic ★ 27k

If you want to do it programatically, it will probably have to be done by consulting Pfam release files.

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release

Taxonomy files are in database_files directory, and it would not surprise me if parsers for them already exist somewhere in Pfam domain.

If you search individually by sequence, the taxonomy info is there already:

http://pfam.xfam.org/protein/O23418

The same is true for HMM families:

http://pfam.xfam.org/family/Glutaredoxin

Click either on Trees or Species to get taxonomy info.

ADD COMMENT
0
Entering edit mode

I've been inspecting the database_files contents, but I'm not sure how to use them. Any suggestions on what I can try?

ADD REPLY
0
Entering edit mode

If you get this taxonomy file from the database_files directory then it seems to contain information in this format

1222326 Caldisericales bacterium enrichment culture clone BSK_106       root;cellular organisms;Bacteria;Caldiserica;Caldisericia;Caldisericales;environmental samples;Caldisericales bacterium enrichment culture clone BSK_106;      3404709 3404710 904954  Caldisericales bacterium enrichment culture clone BSK_106      1       species
1222330 Caldisericales bacterium enrichment culture clone BSK_27        root;cellular organisms;Bacteria;Caldiserica;Caldisericia;Caldisericales;environmental samples;Caldisericales bacterium enrichment culture clone BSK_27;       3404711 3404712 904954  Caldisericales bacterium enrichment culture clone BSK_27       1       species

Number in the first column is NCBI taxID, second column has the name and next column has the phylogeny. Not clear how to relate this back to PFAM. @Mensur may have an idea.

ADD REPLY
0
Entering edit mode
3.7 years ago
jubillante • 0

Here is how I got taxonomy strings starting with a fasta seed alignment

  1. Get the list of accession IDs grep ">" seed.fasta | sed 's/>//' |cut -f1 -d'/' > pfamaccessioncodes.txt
  2. Upload your list (or copy and paste) into the Retrieve/ID mapping tool

https://www.uniprot.org/uploadlists/

  1. Download a tab-separated version of your information. The data you want should be in columns 1 and 7.
ADD COMMENT

Login before adding your answer.

Traffic: 2942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6