Question: Getting taxonomy information from PFAM
0
gravatar for becko
8 months ago by
becko0
becko0 wrote:

Given a protein family sequence alignment from PFAM, I want to get taxonomy information for each of the sequences. For example, for each sequence, I want to know whether it is eukaryote or prokaryote. How can I do this, in Python, Bash or other scriptable tool?

ADD COMMENTlink modified 10 weeks ago by jubillante0 • written 8 months ago by becko0
0
gravatar for Mensur Dlakic
8 months ago by
Mensur Dlakic7.0k
USA
Mensur Dlakic7.0k wrote:

If you want to do it programatically, it will probably have to be done by consulting Pfam release files.

ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release

Taxonomy files are in database_files directory, and it would not surprise me if parsers for them already exist somewhere in Pfam domain.

If you search individually by sequence, the taxonomy info is there already:

http://pfam.xfam.org/protein/O23418

The same is true for HMM families:

http://pfam.xfam.org/family/Glutaredoxin

Click either on Trees or Species to get taxonomy info.

ADD COMMENTlink written 8 months ago by Mensur Dlakic7.0k

I've been inspecting the database_files contents, but I'm not sure how to use them. Any suggestions on what I can try?

ADD REPLYlink written 8 months ago by becko0

If you get this taxonomy file from the database_files directory then it seems to contain information in this format

1222326 Caldisericales bacterium enrichment culture clone BSK_106       root;cellular organisms;Bacteria;Caldiserica;Caldisericia;Caldisericales;environmental samples;Caldisericales bacterium enrichment culture clone BSK_106;      3404709 3404710 904954  Caldisericales bacterium enrichment culture clone BSK_106      1       species
1222330 Caldisericales bacterium enrichment culture clone BSK_27        root;cellular organisms;Bacteria;Caldiserica;Caldisericia;Caldisericales;environmental samples;Caldisericales bacterium enrichment culture clone BSK_27;       3404711 3404712 904954  Caldisericales bacterium enrichment culture clone BSK_27       1       species

Number in the first column is NCBI taxID, second column has the name and next column has the phylogeny. Not clear how to relate this back to PFAM. @Mensur may have an idea.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax91k
0
gravatar for jubillante
10 weeks ago by
jubillante0
jubillante0 wrote:

Here is how I got taxonomy strings starting with a fasta seed alignment

  1. Get the list of accession IDs grep ">" seed.fasta | sed 's/>//' |cut -f1 -d'/' > pfamaccessioncodes.txt
  2. Upload your list (or copy and paste) into the Retrieve/ID mapping tool

https://www.uniprot.org/uploadlists/

  1. Download a tab-separated version of your information. The data you want should be in columns 1 and 7.
ADD COMMENTlink written 10 weeks ago by jubillante0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1733 users visited in the last hour