Question: Mapping DEXseq exons back to genomic DNA sequences
gravatar for htyu1lmzta
3.5 years ago by
htyu1lmzta30 wrote:

I have run a series of experiments using DEXSeq to identify differentially expressed exons in my RNA-seq data, and have produced a list of exons of interest:

> exons
                  feature chr     start      stop strand
1 ENSMUSG00000009927:E001   9  44407139  44407659      +
2 ENSMUSG00000022836:E034  16  35000273  35002420      +
3 ENSMUSG00000025745:E001   5  30118304  30119425      -
4 ENSMUSG00000028034:E001   3 152210422 152210472      +
5 ENSMUSG00000028656:E001   4 122859047 122860217      -

However, because DEXSeq requires a custom GFF file as input (prepared using the bundled script, dexseq_prepare_, differentially expressed exons are not identified using Ensembl identifiers (e.g., ENSMUSE00001401988) but rather using DEXSeq's internal numbering (e.g., E001, E034). Additionally, the exons output by DEXSeq do not always appear to correspond exactly to DEXSeq exons. Sometimes they represent only a portion of the full exon as annotated in Ensembl.

I am interested in analyzing some of the features of the protein sequences of these exons, e.g. InterPro domain content. I therefore matched DEXSeq gene/exon combinations (in the column feature in the example table above) to entries in the DEXSeq GFF file to identify the coordinates of each DEXSeq exon. However, I am having a very difficult time figuring out how to retrieve the protein sequences that correspond to these coordinates.

I tried making a TxDb object (from the GenomicFeatures package) to match DEXSeq GFF exons to the original Ensembl GTF exons, but this led to the discovery that DEXSeq exons do not always correspond exactly to complete Ensembl exons. I next tried making a GRanges object to use getSeq, but this appears to require a BSgenome object and there are no BSgenome packages for Ensembl. Finally, I tried retrieving the cDNA sequence from biomaRt using the following code:

for (i in seq_len(nrow(exons))) {
  seq <- getSequence(chromosome = exons$chr[i],
              start = exons$start[i],
              end = exons$stop[i],
              seqType = "cdna",
              type = "ensembl_exon_id",
              mart = ensembl) 

However, each call to getSequence returns more than one sequence and each sequence appears to be associated with multiple exons. Additionally, it requires that biomaRt use the same Ensembl version as my DEXSeq analysis which seems like it will hinder reproducibility.

There must be a simple way to retrieve the DNA/protein sequences corresponding to the genomic coordinates of the DEXSeq exons in Ensembl 89. What am I missing?

sequencing rna-seq R • 1.4k views
ADD COMMENTlink modified 3.5 years ago by geek_y11k • written 3.5 years ago by htyu1lmzta30
gravatar for geek_y
3.5 years ago by
geek_y11k wrote:

The output of dexseq_prepare_ will have the transcript and gene information, to which each exon belongs to. So you can pull the gene name or transcript name of each differential exons and get their fasta sequences. Infact, in your output, you have the gene information. If you want to know the transctipt information, you need to check the flattened file from dexseq_prepare_

ADD COMMENTlink written 3.5 years ago by geek_y11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2023 users visited in the last hour