I have a set of protein sequences (from Biomart ENSEMBL Release 62) of Homo sapiens. On these sequences I have run PFAM-HMMER to find domains on these sequences. I also have coordinates (coding sequence as well as genomic)for the exons that make up these proteins. I want to map the protein domains to these exons. How do I accomplish this ? Solutions in any of these- PERL, R, Bioconductor,Python, Biomart- will be very helpful. Thanks in advance.
I have already answered this on the bioconductor email list, but to summarize:
- Use the GenomicFeatures bioconductor package to get the locations of the exons.
- Import your PFAM information as a GRanges object (from the GenomicRanges bioconductor package).
- Use findOverlaps() method to find overlaps between the two interval sets.
I don't have a solution but wish to point out a pitfall. Small exons or exons that encode only the extreme amino- or carboxy-terminal portion of a Pfam domain may score too low to be retained for your downstream analysis. One solution to this is to map your Pfam hits against the model and note where there are missing residues. For example, an exon of 60 nt (20 aa) may encode Pfam model residues 1 to 20, but you don't see that hit (high E-value, low score, etc). At the same time, your first real, high-scoring hit begins with Pfam residue 21. Hmm, where are those first 20 aa? Are they in the upstream exon?
I've modified one of my program to create protein2genome, a tool mapping a peptide/start/end to the human genome using the ucsc knowngene database.
The C++ source is available here: http://code.google.com/p/variationtoolkit/source/browse/trunk/src/protein2genome.cpp and has justbeen integrated in my experimental package: varkit
Example:
$ echo -e "#Pep\tpepStart\tpepEnd\tDomain\nZC3H7B\t82\t115\tTPR\nNOTC2_HUMAN\t26\t63\tEGF_DOMAIN" |\
protein2genome |\
verticalize
>>> 2
$1 #Pep ZC3H7B
$2 pepStart 82
$3 pepEnd 115
$4 Domain TPR
$5 knownGene.name uc003azw.2
$6 knownGene.chrom chr22
$7 knownGene.strand +
$8 knownGene.exon Exon 4
$9 domain.start 41721879
$10 domain.end 41721922
<<< 2
>>> 3
$1 #Pep ZC3H7B
$2 pepStart 82
$3 pepEnd 115
$4 Domain TPR
$5 knownGene.name uc003azw.2
$6 knownGene.chrom chr22
$7 knownGene.strand +
$8 knownGene.exon Exon 5
$9 domain.start 41723209
$10 domain.end 41723268
<<< 3
>>> 4
$1 #Pep NOTC2_HUMAN
$2 pepStart 26
$3 pepEnd 63
$4 Domain EGF_DOMAIN
$5 knownGene.name uc001eik.2
$6 knownGene.chrom chr1
$7 knownGene.strand -
$8 knownGene.exon Exon 2
$9 domain.start 120572528
$10 domain.end 120572609
<<< 4
>>> 5
$1 #Pep NOTC2_HUMAN
$2 pepStart 26
$3 pepEnd 63
$4 Domain EGF_DOMAIN
$5 knownGene.name uc001eik.2
$6 knownGene.chrom chr1
$7 knownGene.strand -
$8 knownGene.exon Exon 3
$9 domain.start 120548178
$10 domain.end 120548211
<<< 5
Dear Pierre, how would I use your tool if I had EnsEMBL sequences instead of UCSC known genomes
there is a table mapping the UCSC known Genes to ENSEMBL: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownToEnsembl.txt.gz
Pierre,
Would this also work for cases where one protein domain is encoded by i.e. split across > 1 exon?
I want to only report the number of exons that encode a certain protein domain? How would I modify use of your protein2genome for my purposes?
Dear Pierre,
I am also stuck with the same problem.
i have proteins domain start and end position as well as the .gff3 files for several plants can you help meto get position of domain start and end on cds? i have seen your page on "backlocate" but it works only on UCSC but my data is not available at UCSC or ncbi. thanks in advance !! i hope u will help me out
In the future, please add a note when you cross-post to other forums like the bioconductor email list.