I'm studying to make a T. Cruzi protein classifier, and I've come accross different datasets and I need to join the information from them.
Here is what I've done:
- Pfam Data
I downloaded the Pfam-A.hmm data (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz) and did a hmmscan to get tabular information from proteins
hmmscan --tblout output/output-file-tbl data/Pfam-A.hmm input/T_cruzi_TriTryp-25.fasta
- Swissprot Data
Then I downloaded 103 T. Cruzi protein from Swiss-Prot and I ran the blast command agains my Pfam.fasta data to align these proteins.
formatdb -p T -i input/T_cruzi_TriTryp-25.fasta
blastall -p blastp -d input/T_cruzi_TriTryp-25.fasta -i input/swiss-prot--T-cruzi.fasta -o output/swiss-prot--T-cruzi.blastp.xml -b 2 -e 0.05 -m 8 1> output/out-blast 2> output/error-blast
Then i parsed the xml output in R to get tabular information, but here's the deal.
How do I associate the swissprot data (2) with pfam data(1)?
In Pfam data I have:
- target_name: Pkinase, His_Phos_1, AAA_33, etc.
- accession: PF00069, PF00300, PF13671, etc.
- query_name: TcCLB.509163.70, TcCLB.508625.50, TcCLB.508737.100, etc.
And I noticed that 1 unique accession may have more than one query_name, and 1 unique query_name may have more than one accession.
In Swissprot data I have:
Iteration_query-def: sp|P92188|PSA1_TRYCR Proteasome subunit alpha type-1 OS=Trypanosoma cruzi OX=5693 PE=2 SV=1
Hit_def: TcCLB.506167.40 | organism=Trypanosoma_cruzi_CL_Brener_Esmeraldo-like | product=proteasome subunit alpha type-1, putative, 20s proteasome subunit, putative (PSA1) | location=TcChr37-S:1102119-1102916(-) | length=265 | sequence_SO=chromosome | SO=protein_coding
I tried to use Hit_def query name to join, but is not unique. Am I missunderstanding some concept? Any help would be appreciated.
Maybe I misunderstand what you want, but I think that the information you want is already available at Pfam current release, in file
Pfam-A.regions.tsv
. If not, read the description carefully about what all the files are, as I know for sure that SwissProt results are already pre-calculated for all proteins. It is just a matter of matching the IDs between your proteins and their output.