Given a protein family sequence alignment from PFAM, I want to get taxonomy information for each of the sequences. For example, for each sequence, I want to know whether it is eukaryote or prokaryote. How can I do this, in Python, Bash or other scriptable tool?
If you want to do it programatically, it will probably have to be done by consulting Pfam release files.
Taxonomy files are in
database_files directory, and it would not surprise me if parsers for them already exist somewhere in Pfam domain.
If you search individually by sequence, the taxonomy info is there already:
The same is true for HMM families:
Click either on
Species to get taxonomy info.
Here is how I got taxonomy strings starting with a fasta seed alignment
- Get the list of accession IDs
grep ">" seed.fasta | sed 's/>//' |cut -f1 -d'/' > pfamaccessioncodes.txt
- Upload your list (or copy and paste) into the Retrieve/ID mapping tool
- Download a tab-separated version of your information. The data you want should be in columns 1 and 7.