How to associate Pfam and Swissprot Proteins
0
0
Entering edit mode
4.7 years ago

I'm studying to make a T. Cruzi protein classifier, and I've come accross different datasets and I need to join the information from them.

Here is what I've done:

  1. Pfam Data

I downloaded the Pfam-A.hmm data (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz) and did a hmmscan to get tabular information from proteins

hmmscan --tblout output/output-file-tbl data/Pfam-A.hmm input/T_cruzi_TriTryp-25.fasta
  1. Swissprot Data

Then I downloaded 103 T. Cruzi protein from Swiss-Prot and I ran the blast command agains my Pfam.fasta data to align these proteins.

formatdb -p T -i input/T_cruzi_TriTryp-25.fasta

blastall -p blastp -d input/T_cruzi_TriTryp-25.fasta -i input/swiss-prot--T-cruzi.fasta -o output/swiss-prot--T-cruzi.blastp.xml -b 2 -e 0.05 -m 8 1> output/out-blast 2> output/error-blast

Then i parsed the xml output in R to get tabular information, but here's the deal.


How do I associate the swissprot data (2) with pfam data(1)?


In Pfam data I have:

  • target_name: Pkinase, His_Phos_1, AAA_33, etc.
  • accession: PF00069, PF00300, PF13671, etc.
  • query_name: TcCLB.509163.70, TcCLB.508625.50, TcCLB.508737.100, etc.

And I noticed that 1 unique accession may have more than one query_name, and 1 unique query_name may have more than one accession.


In Swissprot data I have:

Iteration_query-def: sp|P92188|PSA1_TRYCR Proteasome subunit alpha type-1 OS=Trypanosoma cruzi OX=5693 PE=2 SV=1

Hit_def: TcCLB.506167.40 | organism=Trypanosoma_cruzi_CL_Brener_Esmeraldo-like | product=proteasome subunit alpha type-1, putative, 20s proteasome subunit, putative (PSA1) | location=TcChr37-S:1102119-1102916(-) | length=265 | sequence_SO=chromosome | SO=protein_coding


I tried to use Hit_def query name to join, but is not unique. Am I missunderstanding some concept? Any help would be appreciated.

alignment pfam • 1.1k views
ADD COMMENT
0
Entering edit mode

Maybe I misunderstand what you want, but I think that the information you want is already available at Pfam current release, in file Pfam-A.regions.tsv. If not, read the description carefully about what all the files are, as I know for sure that SwissProt results are already pre-calculated for all proteins. It is just a matter of matching the IDs between your proteins and their output.

ADD REPLY

Login before adding your answer.

Traffic: 2442 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6