How To Best Manipulate Deseq Output?
1
1
Entering edit mode
9.6 years ago
derekc10741 ▴ 20

Hi all,

I was wondering whats the best way to extract the identifier of each differentially expressed gene? I've been told it's possible to compare it to a GFF file to obtain protein identifiers but I don't know how best to go about it. The output that I'm referring to is the output from the nbinomTest as performed on DEseq data, which has then been altered so only those with significant p values remain.

Thanks

deseq r • 3.1k views
0
Entering edit mode

what are the identifiers in your counts table? For example, i make my counts table based on ensembl transcripts so i have an ENSMUST for each rowname.

0
Entering edit mode

This is an example of the identifer I'm using gi|222109225|ref|NC_011992.1|:c1488979-1488104 I'm a novice at this so I'm not quite sure what thats for

1
Entering edit mode
9.6 years ago
Dave Bridges ★ 1.4k

This is generally what I use to generate the counts table, its also available here as a R markdown file. You will need to manually set the location of your alignment files and your sample ids.

This will generate both exon and transcript counts tables with the rownames being the ensembl transcript id or ensembl exon ids. These tables can be used in DESeq or DEXseq if you use

data <- read.csv("", rownames="X)


You can use the transcript or exon identifiers directly, but if you need to get the gene names I would use bioMart (or the R package biomaRt) to get those.

1
Entering edit mode

Just an FYI: while you "can" use the exon table as you describe for DEXSeq -- as in, you can plug it into the framework -- this is not how the authors suggest you do so. Your exons object will have many overlapping exons from different transcripts of the same gene, as well as from different genes that happen to share the same transcription locus. The approach they present in their paper "flattens" gene structure(s) into "counting bins," which in many cases is a 1:1 relationship with exons of a gene, but is something very different in regions of overlapping exons and genes.

0
Entering edit mode

Maybe I don't quite understand here. I thought the way described above generates a row for each exon. If an exon is in two transcripts from different, there would be two different ensemblexonids. I thought that this flattened each exon into its own counting bin and that was ok.

0
Entering edit mode

Look at the bottom of Figure 1 of their paper. That's the exon structure you want -- there is a disjoint set of "counting bins" there. What you propose doesn't do this. For instance, I have refseq db built for hg19 sometime back in May of this year, when you do as you suggest:

e <- exons(txdb, ...)
isDisjoint(e) ## FALSE (<- should be TRUE)
r <- reduce(e) ## Collapses overlapping exons, now:
isDisjoint(r) ## TRUE


Using your call to exons, you do get one row per exon, as you say. I was just point out that this isn't the structure that they talk about in the paper, is all.

Updating this comment before a follow up reply: the reduce operation doesn't get you exactly there either, as it doesn't split overlapping exons, it just mashes them together -- I just used it to show an instance of a disjoint set of GRanges