Question

How to associate Pfam and Swissprot Proteins

0

Entering edit mode

4.7 years ago

ciro.ferreira.oliveira • 0

I'm studying to make a T. Cruzi protein classifier, and I've come accross different datasets and I need to join the information from them.

Here is what I've done:

Pfam Data

I downloaded the Pfam-A.hmm data (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz) and did a hmmscan to get tabular information from proteins

hmmscan --tblout output/output-file-tbl data/Pfam-A.hmm input/T_cruzi_TriTryp-25.fasta

Swissprot Data

Then I downloaded 103 T. Cruzi protein from Swiss-Prot and I ran the blast command agains my Pfam.fasta data to align these proteins.

formatdb -p T -i input/T_cruzi_TriTryp-25.fasta

blastall -p blastp -d input/T_cruzi_TriTryp-25.fasta -i input/swiss-prot--T-cruzi.fasta -o output/swiss-prot--T-cruzi.blastp.xml -b 2 -e 0.05 -m 8 1> output/out-blast 2> output/error-blast

Then i parsed the xml output in R to get tabular information, but here's the deal.

How do I associate the swissprot data (2) with pfam data(1)?

In Pfam data I have:

target_name: Pkinase, His_Phos_1, AAA_33, etc.
accession: PF00069, PF00300, PF13671, etc.
query_name: TcCLB.509163.70, TcCLB.508625.50, TcCLB.508737.100, etc.

And I noticed that 1 unique accession may have more than one query_name, and 1 unique query_name may have more than one accession.

In Swissprot data I have:

Iteration_query-def: sp|P92188|PSA1_TRYCR Proteasome subunit alpha type-1 OS=Trypanosoma cruzi OX=5693 PE=2 SV=1

I tried to use Hit_def query name to join, but is not unique. Am I missunderstanding some concept? Any help would be appreciated.

alignment pfam • 1.1k views

ADD COMMENT • link 4.7 years ago by ciro.ferreira.oliveira • 0

0

Entering edit mode

Maybe I misunderstand what you want, but I think that the information you want is already available at Pfam current release, in file Pfam-A.regions.tsv. If not, read the description carefully about what all the files are, as I know for sure that SwissProt results are already pre-calculated for all proteins. It is just a matter of matching the IDs between your proteins and their output.

ADD REPLY • link 4.7 years ago by Mensur Dlakic ★ 27k