Question

Finding The Sequence Of A Domain

4

Entering edit mode

14.0 years ago

Shweta ▴ 90

I want to know how to get the amino sequence of a protein domain; e.g. ice (interleukin converting enzyme) has 2 domains- the CARD domain and PeptidaseC14 domain. Although I have the fasta sequence of the entire ICE protein, I'd like to know the what sequence stretches the CARD domain, and likewise, the PeptidaseC14 domain. (I have seen a page in the KEGG database that shows this demarcation, but am not able to recollect it)

protein domain • 12k views

ADD COMMENT • link updated 14.0 years ago by Aleksandr Levchuk 3.2k • written 14.0 years ago by Shweta ▴ 90

1

Entering edit mode

Check this answer for a similar question

Extract Domain Sequences From Multiple Sequences

ADD REPLY • link updated 5.8 years ago by zx8754 12k • written 14.0 years ago by Rm 8.3k

0

Entering edit mode

@Moon, can you provide your FASTA file?

ADD REPLY • link 14.0 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

Uniprot sequence:

sp|P29466|CASP1_HUMAN Caspase-1 OS=Homo sapiens GN=CASP1 PE=1 SV=1 MADKVLKEKRKLFIRSMGEGTINGLLDELLQTRVLNKEEMEKVKRENATVMDKTRALIDS VIPKGAQACQICITYICEEDSYLAGTLGLSADQTSGNYLNMQDSQGVLSSFPAPQAVQDN PAMPTSSGSEGNVKLCSLEEAQRIWKQKSAEIYPIMDKSSRTRLALIICNEEFDSIPRRT GAEVDITGMTMLLQNLGYSVDVKKNLTASDMTTELEAFAHRPEHKTSDSTFLVFMSHGIR EGICGKKHSEQVPDILQLNAIFNMLNTKNCPSLKDKPKVIIIQACRGDSPGVVWFKDSVG VSGNLSLPTTEEFEDDAIKKAHIEKDFIAFCSSTPDNVSWRHPTMGSVFIGRLIEHMQEY ACSCDVEEIFRKVRFSFEQPDGRAQMPTTERVTLTRCFYLFPGH

ADD REPLY • link 14.0 years ago by Shweta ▴ 90

0

Entering edit mode

@Moon, yes my method worked on your sequence. There results are here http://biocluster.ucr.edu/~alevchuk/finding-the-sequence-of-a-domain/results/ - but I obfuscated the AA sequences with X's just in case I'm right in my suspicions that this is a homework assignment.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 14.0 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

@Moon, yes my method worked on your sequence. There results are here http://biocluster.ucr.edu/~alevchuk/finding-the-sequence-of-a-domain/results/ Welcome to Biostars.org!

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 14.0 years ago by Aleksandr Levchuk 3.2k

Ram · Answer 1 · 2011-07-12

The domains of your interest CARD and Peptidase_C14 are both in the Pfam25.0 A database so the following method will work.

This method has the folowing advantages:

It works on arbitrary sequences. Even ones that don't exist in any public databases. For example simulated data.
It works on arbitrary protein domain HMM models. For example if you build your own models from MSAs.
It's completely scripted. No need click on potentially a large number of links.

Step 1

To extract the sequences of domains you will first need the start and end positions. The following shows how to get the positions with HMMER 3 tool and the Pfam25.0 A database.

I assume that your original sequence is in my.fasta

NOTE: For less well-known domains, you can repeat this search for Pfam25.0 B.

Output of Step 1: my.fasta-found-domains.tab-extract.tab

Step 2

Now that you have the coordinates, you can extract the sequences for the domain with the following R script:

005-extract-seq

Output of Step 2: results

The whole package

To run the entire method with example data, do this:

mkdir finding-the-sequence-of-a-domain
cd finding-the-sequence-of-a-domain

wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/001-download-data
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/002-prepare-hmm
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/003-scan
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/004-extract-coords
wget https://raw.github.com/alevchuk/finding-the-sequence-of-a-domain/master/005-extract-seq

chmod +x 00*

time ./001-download-data    # Takes ~45 seconds

# Requires HMMER 3
time ./002-prepare-hmm      # Takes ~1.5 minutes
time ./003-scan my.fasta    # Takes ~30 seconds

time ./004-extract-coords my.fasta-found-domains.tab

# Requires Biostrings R package
# (install instructions here http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html)
time ./005-extract-seq my.fasta

Resulting FASTA files will be in the results directory, just like these ones: https://github.com/alevchuk/finding-the-sequence-of-a-domain/tree/master/results

score 3 · Answer 2 · 2011-07-14

3

Entering edit mode

14.0 years ago

Stajich ▴ 30

I think we also have a solution this with BioPerl, the code is available here

It assumes you've run hmmer3 with --domtblout option and you are passing that and the FASTA file of you proteins as arguments.

ADD COMMENT • link 14.0 years ago by Stajich ▴ 30

score 2 · Answer 3 · 2011-07-12

There are several protein domain databases that provide this information. Three examples are PFAM, SMART and PROSITE. And there is Interpro, which contains data from various domain databases. All these databases provide a server where you can paste a protein sequence or uniprot accession number and will be informed about the position of the domain. If you are mainly interested in the sequence of the domain instance, you should try SMART and PROSITE because those two give you the sequence directly. With the other databases, you have to extract the domains based on the reported domain position.

For well-known domains (as the ones in your example) there is another possibility: those domains are often annotated directly in the uniprot entry see e.g. here If you scroll down to the feature table, the CARD domain is listed and you can click on the line '1-91' and get the sequence of the CARD domain highlighted.