Question: Gene Expression Experiment Using Ngs Data
3
gravatar for Eric Normandeau
9.6 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Hi,

I am toying with a new Next Gen Sequencing dataset in which each sequence is tagged according to the individual from which it was extracted. In this 454 experiment, we received about 1.8 million sequences in total. cDNA was the starting material for this experiment so, in each contig (or gene), the number of reads from an individual is correlated to the level of expression of that gene in that individual.

What are the normalization steps that should be applied to the sequence counts per individuals in order to be able to use these measures as a 'level of expression'?

The two that come to mind immediately are:

  1. Divide by the total number of sequences in each experimental group
  2. Divide by the number of sequences in each individual

What else do you think should be done?

Thanks!

gene data next-gen sequencing • 3.6k views
ADD COMMENTlink modified 8.8 years ago • written 9.6 years ago by Eric Normandeau10k

Is this a SAGE experiment (Serial analysis of gene expression)?

ADD REPLYlink written 9.6 years ago by Paulo Nuin3.7k

sounds like DGE

ADD REPLYlink written 9.6 years ago by Jeremy Leipzig18k

This is a 454 experiment. We are doing a few things with these data, including this exploration of gene expression differences between the study groups. I added a precision regarding the NGS method to the question.

ADD REPLYlink written 9.6 years ago by Eric Normandeau10k

One thing is the technique you're using, the other thing is what your data represents. As Jeremy pointed out it looks like DGE, which is very similar to SAGE. My answer is below.

ADD REPLYlink written 9.6 years ago by Paulo Nuin3.7k
6
gravatar for Paulo Nuin
9.6 years ago by
Paulo Nuin3.7k
Canada
Paulo Nuin3.7k wrote:

I think your questions are very broad and there's no simple answer, especially because they involve a lot of statistics, and less of a computer approach. SAGE/DGE data is very different than microarray, regarding its analysis and sometimes straightforward methods used in MA analysis cannot be applied here.

For this type of data, the best option that I found was edgeR, a R/Bioconductor package. Be sure to read the docs and some extra information that comes with the package.

http://www.bioconductor.org/packages/bioc/html/edgeR.html

ADD COMMENTlink written 9.6 years ago by Paulo Nuin3.7k

Thank you @nuin. What were the other options that you surveyed? Why was edgeR the best? Cheers

ADD REPLYlink written 9.6 years ago by Eric Normandeau10k

If I'm not wrong I tried another Bioconductor package that is not available anymore (or not updated). edgeR is very simple to use, the manual is well written and it gives you good results, including nice graphs.

ADD REPLYlink written 9.6 years ago by Paulo Nuin3.7k

@nuin, the link seems to be broken. The following link seems to be the right one now: http://bioconductor.org/packages/2.6/bioc/html/edgeR.html

ADD REPLYlink written 9.6 years ago by Eric Normandeau10k
2
gravatar for Istvan Albert
9.6 years ago by
Istvan Albert ♦♦ 81k
University Park, USA
Istvan Albert ♦♦ 81k wrote:

Some options that come to mind:

  • use housekeeping genes - genes with stable and unchanged expression levels - to estimate variability
  • similarly to cross slide microarray normalization methods you may want to assume that the average expression levels are the same for each individual
  • use spiked controls - maybe a little late for that

Definitely look for artifacts introduced by PCR amplification.

ADD COMMENTlink written 9.6 years ago by Istvan Albert ♦♦ 81k

Nice suggestions. Points 1 and 2 will be done. Too late, as you say, for spiked controls, but I keep the idea. How would you look for artifacts introduced by PCR amplification?

ADD REPLYlink written 9.6 years ago by Eric Normandeau10k

Unusually high counts are one indication, basically looking for neighbouring or overlapping regions that have wildly different read coverages.

ADD REPLYlink written 9.6 years ago by Istvan Albert ♦♦ 81k
2
gravatar for Jeremy Leipzig
9.6 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

I would not reinvent the wheel here since DGE has been around for 3 years or so. First do a literature search starting with Avi Mortazavi's articles.

ADD COMMENTlink written 9.6 years ago by Jeremy Leipzig18k

Yeah, I do "recommend Mapping and quantifying mammalian transcriptomes by RNA-Seq" in Nature Methods. It deals with this kind of data with all possible problem (segmental duplications, gene duplications, etc.). But, as the samples are from different subjects without known genomes, these problems will be quite amplified. Carefully chosen reference sequences are priority one.

ADD REPLYlink written 9.6 years ago by Jarretinha3.3k

Thanks for the references guys.

ADD REPLYlink written 9.6 years ago by Eric Normandeau10k

Thanks for the references both of you!

ADD REPLYlink written 9.6 years ago by Eric Normandeau10k
2
gravatar for Eric Normandeau
9.6 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Following suggested readings in your replies, I have stumbled upon a very recent R package, called DESeq, which seems tailored for my application. Specifically, as they mention in the documentation, DESeq:

provides a powerful tool to estimate the variance in such data [RNA-seq and others] and test for differential expression.

Starting from a table of sequence counts (one line per gene, one column per sample, including proper treatment of replicates), it outputs (among many things) a list of p-values regarding the differential expression of genes between samples, taken 2 by 2. Documentation is pretty complete and very comprehensible.

Just wanted to share! Here are the links to the DESeq package download and information pages and the 'companion paper':

DESeq package

DESeq information page

Companion paper

Cheers!

ADD COMMENTlink written 9.6 years ago by Eric Normandeau10k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 967 users visited in the last hour