Question

Getting taxonomy information from PFAM

0

Entering edit mode

4.2 years ago

bef1 • 0

Given a protein family sequence alignment from PFAM, I want to get taxonomy information for each of the sequences. For example, for each sequence, I want to know whether it is eukaryote or prokaryote. How can I do this, in Python, Bash or other scriptable tool?

sequence alignment genome sequencing • 1.3k views

ADD COMMENT • link updated 3.7 years ago by jubillante • 0 • written 4.2 years ago by bef1 • 0

score 0 · Answer 1 · 2020-02-25

0

Entering edit mode

4.2 years ago

Mensur Dlakic ★ 27k

If you want to do it programatically, it will probably have to be done by consulting Pfam release files.

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release

Taxonomy files are in database_files directory, and it would not surprise me if parsers for them already exist somewhere in Pfam domain.

If you search individually by sequence, the taxonomy info is there already:

http://pfam.xfam.org/protein/O23418

The same is true for HMM families:

http://pfam.xfam.org/family/Glutaredoxin

Click either on Trees or Species to get taxonomy info.

ADD COMMENT • link 4.2 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I've been inspecting the database_files contents, but I'm not sure how to use them. Any suggestions on what I can try?

ADD REPLY • link 4.2 years ago by bef1 • 0

0

Entering edit mode

If you get this taxonomy file from the database_files directory then it seems to contain information in this format

1222326 Caldisericales bacterium enrichment culture clone BSK_106       root;cellular organisms;Bacteria;Caldiserica;Caldisericia;Caldisericales;environmental samples;Caldisericales bacterium enrichment culture clone BSK_106;      3404709 3404710 904954  Caldisericales bacterium enrichment culture clone BSK_106      1       species
1222330 Caldisericales bacterium enrichment culture clone BSK_27        root;cellular organisms;Bacteria;Caldiserica;Caldisericia;Caldisericales;environmental samples;Caldisericales bacterium enrichment culture clone BSK_27;       3404711 3404712 904954  Caldisericales bacterium enrichment culture clone BSK_27       1       species

Number in the first column is NCBI taxID, second column has the name and next column has the phylogeny. Not clear how to relate this back to PFAM. @Mensur may have an idea.

ADD REPLY • link 4.2 years ago by GenoMax 141k

score 0 · Answer 2 · 2020-08-11

Here is how I got taxonomy strings starting with a fasta seed alignment

Get the list of accession IDs grep ">" seed.fasta | sed 's/>//' |cut -f1 -d'/' > pfamaccessioncodes.txt
Upload your list (or copy and paste) into the Retrieve/ID mapping tool

https://www.uniprot.org/uploadlists/

Download a tab-separated version of your information. The data you want should be in columns 1 and 7.