Question: Mapping DEXseq exons back to genomic DNA sequences
0
gravatar for htyu1lmzta
23 months ago by
htyu1lmzta0
htyu1lmzta0 wrote:

I have run a series of experiments using DEXSeq to identify differentially expressed exons in my RNA-seq data, and have produced a list of exons of interest:

> exons
                  feature chr     start      stop strand
1 ENSMUSG00000009927:E001   9  44407139  44407659      +
2 ENSMUSG00000022836:E034  16  35000273  35002420      +
3 ENSMUSG00000025745:E001   5  30118304  30119425      -
4 ENSMUSG00000028034:E001   3 152210422 152210472      +
5 ENSMUSG00000028656:E001   4 122859047 122860217      -
...

However, because DEXSeq requires a custom GFF file as input (prepared using the bundled script, dexseq_prepare_ annotation.py), differentially expressed exons are not identified using Ensembl identifiers (e.g., ENSMUSE00001401988) but rather using DEXSeq's internal numbering (e.g., E001, E034). Additionally, the exons output by DEXSeq do not always appear to correspond exactly to DEXSeq exons. Sometimes they represent only a portion of the full exon as annotated in Ensembl.

I am interested in analyzing some of the features of the protein sequences of these exons, e.g. InterPro domain content. I therefore matched DEXSeq gene/exon combinations (in the column feature in the example table above) to entries in the DEXSeq GFF file to identify the coordinates of each DEXSeq exon. However, I am having a very difficult time figuring out how to retrieve the protein sequences that correspond to these coordinates.

I tried making a TxDb object (from the GenomicFeatures package) to match DEXSeq GFF exons to the original Ensembl GTF exons, but this led to the discovery that DEXSeq exons do not always correspond exactly to complete Ensembl exons. I next tried making a GRanges object to use getSeq, but this appears to require a BSgenome object and there are no BSgenome packages for Ensembl. Finally, I tried retrieving the cDNA sequence from biomaRt using the following code:

for (i in seq_len(nrow(exons))) {
  seq <- getSequence(chromosome = exons$chr[i],
              start = exons$start[i],
              end = exons$stop[i],
              seqType = "cdna",
              type = "ensembl_exon_id",
              mart = ensembl) 
}

However, each call to getSequence returns more than one sequence and each sequence appears to be associated with multiple exons. Additionally, it requires that biomaRt use the same Ensembl version as my DEXSeq analysis which seems like it will hinder reproducibility.

There must be a simple way to retrieve the DNA/protein sequences corresponding to the genomic coordinates of the DEXSeq exons in Ensembl 89. What am I missing?

sequencing rna-seq R • 888 views
ADD COMMENTlink modified 22 months ago by geek_y9.7k • written 23 months ago by htyu1lmzta0
0
gravatar for geek_y
22 months ago by
geek_y9.7k
Barcelona/CRG/London/Imperial
geek_y9.7k wrote:

The output of dexseq_prepare_ annotation.py will have the transcript and gene information, to which each exon belongs to. So you can pull the gene name or transcript name of each differential exons and get their fasta sequences. Infact, in your output, you have the gene information. If you want to know the transctipt information, you need to check the flattened file from dexseq_prepare_ annotation.py

ADD COMMENTlink written 22 months ago by geek_y9.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1022 users visited in the last hour