Question: Mapping Protein Domains To Exons
6
gravatar for Anupam Sinha
7.3 years ago by
Anupam Sinha60
Anupam Sinha60 wrote:

I have a set of protein sequences (from Biomart ENSEMBL Release 62) of Homo sapiens. On these sequences I have run PFAM-HMMER to find domains on these sequences. I also have coordinates (coding sequence as well as genomic)for the exons that make up these proteins. I want to map the protein domains to these exons. How do I accomplish this ? Solutions in any of these- PERL, R, Bioconductor,Python, Biomart- will be very helpful. Thanks in advance.

R exon protein mapping • 7.1k views
ADD COMMENTlink written 7.3 years ago by Anupam Sinha60
1

In the future, please add a note when you cross-post to other forums like the bioconductor email list.

ADD REPLYlink written 7.3 years ago by Sean Davis25k
2
gravatar for Sean Davis
7.3 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

I have already answered this on the bioconductor email list, but to summarize:

  1. Use the GenomicFeatures bioconductor package to get the locations of the exons.
  2. Import your PFAM information as a GRanges object (from the GenomicRanges bioconductor package).
  3. Use findOverlaps() method to find overlaps between the two interval sets.
ADD COMMENTlink written 7.3 years ago by Sean Davis25k

Thanks a lot Sean

ADD REPLYlink written 7.3 years ago by Anupam Sinha60

But the pfam information is a protein sequence, i.e aminoacids, so how doyou import this with Genomic ranges and then overlap with the exons? Because, what you are trying to get is the genomic coordinates of this domains

ADD REPLYlink written 7.1 years ago by Tonig430
1
gravatar for Larry_Parnell
7.3 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

I don't have a solution but wish to point out a pitfall. Small exons or exons that encode only the extreme amino- or carboxy-terminal portion of a Pfam domain may score too low to be retained for your downstream analysis. One solution to this is to map your Pfam hits against the model and note where there are missing residues. For example, an exon of 60 nt (20 aa) may encode Pfam model residues 1 to 20, but you don't see that hit (high E-value, low score, etc). At the same time, your first real, high-scoring hit begins with Pfam residue 21. Hmm, where are those first 20 aa? Are they in the upstream exon?

ADD COMMENTlink written 7.3 years ago by Larry_Parnell16k
1
gravatar for Pierre Lindenbaum
7.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

I've modified one of my program to create protein2genome, a tool mapping a peptide/start/end to the human genome using the ucsc knowngene database.

The C++ source is available here: http://code.google.com/p/variationtoolkit/source/browse/trunk/src/protein2genome.cpp and has justbeen integrated in my experimental package: varkit

Example:

$ echo -e "#Pep\tpepStart\tpepEnd\tDomain\nZC3H7B\t82\t115\tTPR\nNOTC2_HUMAN\t26\t63\tEGF_DOMAIN" |\
  protein2genome |\
  verticalize

>>> 2
$1  #Pep                ZC3H7B
$2  pepStart            82
$3  pepEnd              115
$4  Domain              TPR
$5  knownGene.name      uc003azw.2
$6  knownGene.chrom     chr22
$7  knownGene.strand    +
$8  knownGene.exon      Exon 4
$9  domain.start        41721879
$10 domain.end          41721922
<<< 2

>>> 3
$1  #Pep                ZC3H7B
$2  pepStart            82
$3  pepEnd              115
$4  Domain              TPR
$5  knownGene.name      uc003azw.2
$6  knownGene.chrom     chr22
$7  knownGene.strand    +
$8  knownGene.exon      Exon 5
$9  domain.start        41723209
$10 domain.end          41723268
<<< 3

>>> 4
$1  #Pep                NOTC2_HUMAN
$2  pepStart            26
$3  pepEnd              63
$4  Domain              EGF_DOMAIN
$5  knownGene.name      uc001eik.2
$6  knownGene.chrom     chr1
$7  knownGene.strand    -
$8  knownGene.exon      Exon 2
$9  domain.start        120572528
$10 domain.end          120572609
<<< 4

>>> 5
$1  #Pep                NOTC2_HUMAN
$2  pepStart            26
$3  pepEnd              63
$4  Domain              EGF_DOMAIN
$5  knownGene.name      uc001eik.2
$6  knownGene.chrom     chr1
$7  knownGene.strand    -
$8  knownGene.exon      Exon 3
$9  domain.start        120548178
$10 domain.end          120548211
<<< 5
ADD COMMENTlink written 7.3 years ago by Pierre Lindenbaum119k

Thanks a lot Pierre

ADD REPLYlink written 7.3 years ago by Anupam Sinha60

Dear Pierre, how would I use your tool if I had EnsEMBL sequences instead of UCSC known genomes

ADD REPLYlink written 6.9 years ago by Anupam40
1

there is a table mapping the UCSC known Genes to ENSEMBL: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownToEnsembl.txt.gz

ADD REPLYlink written 6.9 years ago by Pierre Lindenbaum119k

Pierre,

Would this also work for cases where one protein domain is encoded by i.e. split across > 1 exon?

I want to only report the number of exons that encode a certain protein domain? How would I modify use of your protein2genome for my purposes?

ADD REPLYlink written 3.6 years ago by Anand Rao210

Dear Pierre,

I am also stuck with the same problem.

i have proteins domain start and end position as well as the .gff3 files for several plants can you help meto get position of domain start and end on cds? i have seen your page on "backlocate" but it works only on UCSC but my data is not available at UCSC or ncbi. thanks in advance !! i hope u will help me out

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by amish0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 657 users visited in the last hour