How To Best Manipulate Deseq Output?

1

Entering edit mode

12.6 years ago

derekc10741 ▴ 20

Hi all,

I was wondering whats the best way to extract the identifier of each differentially expressed gene? I've been told it's possible to compare it to a GFF file to obtain protein identifiers but I don't know how best to go about it. The output that I'm referring to is the output from the nbinomTest as performed on DEseq data, which has then been altered so only those with significant p values remain.

Thanks

deseq r • 4.1k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 12.6 years ago by derekc10741 ▴ 20

0

Entering edit mode

what are the identifiers in your counts table? For example, i make my counts table based on ensembl transcripts so i have an ENSMUST for each rowname.

ADD REPLY • link 12.6 years ago by Dave Bridges ★ 1.4k

0

Entering edit mode

This is an example of the identifer I'm using gi|222109225|ref|NC_011992.1|:c1488979-1488104 I'm a novice at this so I'm not quite sure what thats for

ADD REPLY • link 12.6 years ago by derekc10741 ▴ 20

1

Entering edit mode

12.6 years ago

Dave Bridges ★ 1.4k

This is generally what I use to generate the counts table, its also available here as a R markdown file. You will need to manually set the location of your alignment files and your sample ids.

	require(GenomicFeatures)
	require(biomaRt)
	#set this to where your tophat_out directory is.
	working_directory = "/some_directory/"
	setwd(working_directory)
	#make a database of transcripts from the ensembl assembly
	#first get the current release 69 from ensembl, change this to your species if not using mice
	txdb <- makeTranscriptDbFromBiomart(biomart="ensembl",dataset = 'mmusculus_gene_ensembl')
	#make exon and transcript annotation objects
	save(txdb, file="txdb.Robject")
	exons <- exons(txdb, columns=c('gene_id', 'tx_id', 'tx_name', 'tx_chrom', 'tx_strand', 'exon_id', 'exon_name', 'cds_id', 'cds_name', 'cds_chrom', 'cds_strand', 'exon_rank'))
	save(exons, file="exons.Robject")
	transcripts <- transcripts(txdb, columns=c('gene_id', 'tx_id', 'tx_name', 'exon_id', 'exon_name', 'exon_chrom', 'exon_strand', 'cds_id', 'cds_name', 'cds_chrom', 'cds_strand', 'exon_rank'))


	require(GenomicRanges)
	require(Rsamtools)

	#set list of sample ids as a vector
	sample_ids = seq(12849,12858)

	transcript.countsTable <- data.frame(
	row.names = as.vector(unlist(elementMetadata(transcripts)['tx_name'])))
	exon.countsTable <- data.frame(
	row.names = as.vector(unlist(elementMetadata(exons)['exon_name'])))

	#this forloop iterates over the sample_ids and generates exon and transcript counts for each sample_id
	for(sample_id in sample_ids) {
	#read alignment
	align <- readBamGappedAlignments(sprintf("tophat_out/Sample_%s/accepted_hits.bam", sample_id))
	#count the overlapping reads for the transcripts
	transcript.counts <- countOverlaps(transcripts, align)
	#reassign to a specific transcript.counts object.
	assign(sprintf("transcript.counts.%s", sample_id), transcript.counts)
	#add this column to the countsTable
	transcript.countsTable <- cbind(transcript.countsTable,transcript.counts)
	remove(transcript.counts)
	#count the overlapping reads for the exons
	exon.counts <- countOverlaps(exons, align)
	#reassign to a specific transcript.counts object.
	assign(sprintf("exon.counts.%s", sample_id), exon.counts)
	#add this column to the countsTable
	exon.countsTable <- cbind(exon.countsTable, exon.counts)
	#remove the align, transcript.counts and exon.counts objects for the next loop
	remove(align)
	remove(transcript.counts)
	remove(exon.counts)
	}

	summary(transcriptCountsTable)
	summary(exonCountsTable)

	#write these two counts tables to csv files.
	write.csv(transcriptCountsTable, "transcript_counts_table.csv")
	write.csv(exonCountsTable, "exon_counts_table.csv")

view raw counts_table.R hosted with ❤ by GitHub

This will generate both exon and transcript counts tables with the rownames being the ensembl transcript id or ensembl exon ids. These tables can be used in DESeq or DEXseq if you use

data <- read.csv("", rownames="X)

You can use the transcript or exon identifiers directly, but if you need to get the gene names I would use bioMart (or the R package biomaRt) to get those.

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 12.6 years ago by Dave Bridges ★ 1.4k

1

Entering edit mode

Just an FYI: while you "can" use the exon table as you describe for DEXSeq -- as in, you can plug it into the framework -- this is not how the authors suggest you do so. Your exons object will have many overlapping exons from different transcripts of the same gene, as well as from different genes that happen to share the same transcription locus. The approach they present in their paper "flattens" gene structure(s) into "counting bins," which in many cases is a 1:1 relationship with exons of a gene, but is something very different in regions of overlapping exons and genes.

ADD REPLY • link 12.6 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

Maybe I don't quite understand here. I thought the way described above generates a row for each exon. If an exon is in two transcripts from different, there would be two different ensemblexonids. I thought that this flattened each exon into its own counting bin and that was ok.

ADD REPLY • link 12.6 years ago by Dave Bridges ★ 1.4k

0

Entering edit mode

Look at the bottom of Figure 1 of their paper. That's the exon structure you want -- there is a disjoint set of "counting bins" there. What you propose doesn't do this. For instance, I have refseq db built for hg19 sometime back in May of this year, when you do as you suggest:

e <- exons(txdb, ...)
isDisjoint(e) ## FALSE (<- should be TRUE)
r <- reduce(e) ## Collapses overlapping exons, now:
isDisjoint(r) ## TRUE

Using your call to exons, you do get one row per exon, as you say. I was just point out that this isn't the structure that they talk about in the paper, is all.

Updating this comment before a follow up reply: the reduce operation doesn't get you exactly there either, as it doesn't split overlapping exons, it just mashes them together -- I just used it to show an instance of a disjoint set of GRanges

ADD REPLY • link 12.6 years ago by Steve Lianoglou 5.2k

Login before adding your answer.