Gene Expression Experiment Using Ngs Data
4
3
Entering edit mode
14.6 years ago

Hi,

I am toying with a new Next Gen Sequencing dataset in which each sequence is tagged according to the individual from which it was extracted. In this 454 experiment, we received about 1.8 million sequences in total. cDNA was the starting material for this experiment so, in each contig (or gene), the number of reads from an individual is correlated to the level of expression of that gene in that individual.

What are the normalization steps that should be applied to the sequence counts per individuals in order to be able to use these measures as a 'level of expression'?

The two that come to mind immediately are:

  1. Divide by the total number of sequences in each experimental group
  2. Divide by the number of sequences in each individual

What else do you think should be done?

Thanks!

next-gen-sequencing gene-expression • 5.1k views
ADD COMMENT
0
Entering edit mode

Is this a SAGE experiment (Serial analysis of gene expression)?

ADD REPLY
0
Entering edit mode

sounds like DGE

ADD REPLY
0
Entering edit mode

This is a 454 experiment. We are doing a few things with these data, including this exploration of gene expression differences between the study groups. I added a precision regarding the NGS method to the question.

ADD REPLY
0
Entering edit mode

One thing is the technique you're using, the other thing is what your data represents. As Jeremy pointed out it looks like DGE, which is very similar to SAGE. My answer is below.

ADD REPLY
6
Entering edit mode
14.6 years ago
Paulo Nuin ★ 3.7k

I think your questions are very broad and there's no simple answer, especially because they involve a lot of statistics, and less of a computer approach. SAGE/DGE data is very different than microarray, regarding its analysis and sometimes straightforward methods used in MA analysis cannot be applied here.

For this type of data, the best option that I found was edgeR, a R/Bioconductor package. Be sure to read the docs and some extra information that comes with the package.

http://www.bioconductor.org/packages/bioc/html/edgeR.html

ADD COMMENT
0
Entering edit mode

Thank you @nuin. What were the other options that you surveyed? Why was edgeR the best? Cheers

ADD REPLY
0
Entering edit mode

If I'm not wrong I tried another Bioconductor package that is not available anymore (or not updated). edgeR is very simple to use, the manual is well written and it gives you good results, including nice graphs.

ADD REPLY
0
Entering edit mode

@nuin, the link seems to be broken. The following link seems to be the right one now: http://bioconductor.org/packages/2.6/bioc/html/edgeR.html

ADD REPLY
2
Entering edit mode
14.6 years ago

Some options that come to mind:

  • use housekeeping genes - genes with stable and unchanged expression levels - to estimate variability
  • similarly to cross slide microarray normalization methods you may want to assume that the average expression levels are the same for each individual
  • use spiked controls - maybe a little late for that

Definitely look for artifacts introduced by PCR amplification.

ADD COMMENT
0
Entering edit mode

Nice suggestions. Points 1 and 2 will be done. Too late, as you say, for spiked controls, but I keep the idea. How would you look for artifacts introduced by PCR amplification?

ADD REPLY
0
Entering edit mode

Unusually high counts are one indication, basically looking for neighbouring or overlapping regions that have wildly different read coverages.

ADD REPLY
2
Entering edit mode
14.6 years ago

I would not reinvent the wheel here since DGE has been around for 3 years or so. First do a literature search starting with Avi Mortazavi's articles.

ADD COMMENT
0
Entering edit mode

Yeah, I do "recommend Mapping and quantifying mammalian transcriptomes by RNA-Seq" in Nature Methods. It deals with this kind of data with all possible problem (segmental duplications, gene duplications, etc.). But, as the samples are from different subjects without known genomes, these problems will be quite amplified. Carefully chosen reference sequences are priority one.

ADD REPLY
0
Entering edit mode

Thanks for the references guys.

ADD REPLY
0
Entering edit mode

Thanks for the references both of you!

ADD REPLY
2
Entering edit mode
14.6 years ago

Following suggested readings in your replies, I have stumbled upon a very recent R package, called DESeq, which seems tailored for my application. Specifically, as they mention in the documentation, DESeq:

provides a powerful tool to estimate the variance in such data [RNA-seq and others] and test for differential expression.

Starting from a table of sequence counts (one line per gene, one column per sample, including proper treatment of replicates), it outputs (among many things) a list of p-values regarding the differential expression of genes between samples, taken 2 by 2. Documentation is pretty complete and very comprehensible.

Just wanted to share! Here are the links to the DESeq package download and information pages and the 'companion paper':

Cheers!

ADD COMMENT

Login before adding your answer.

Traffic: 1858 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6