working out differntially expressed genes
1
0
Entering edit mode
8.9 years ago

i have got approx 2500 lncrna and want to find out the differentially expressed genes. I fetched the data for the lncrna from gene_exp.diff. now some of the fpkm values in both control and stress are 0. I have read in a paper that first normalize fpkm values by adding 0.0001 then calculate foldchange and for differentially expressed genes proceed as

upregulated: fold change>=2 and p value <=0.05

downregulated:fold change<=0.5 and p value <=0.05

yet in another paper I read that first filter out fpkm >=0.1 in any tissue.

then after filtering proceed with adding 0.0001 to fpkm and then calculate upregulated and downregulated.

my question: which way to proceed and what is the difference between the two?

next-gen • 2.3k views
ADD COMMENT
0
Entering edit mode

I can't really tell given the information you have- I am guessing you use Cufflinks. However, FPKM, RPKM and others should always be taken with a pinch of salt. You need to know what tools were used to align the transcripts, and how the counting process was done. It would also help if you could post what samples you have, and what conditions you were testing (different tissues, times series, different treatments?).

The logic is that a few reads aligned to a gene don't really mean anything (it is the law of high numbers - a better coverage/sequencing depth means a better approximation of the 'real' expression).

Typically, differentially expressed genes are represented as a MA plot: the expression level vs the fold change. If a gene is well expressed and changes a lot, it is a good candidate. Otherwise, you can't conclude.

ADD REPLY
0
Entering edit mode

Yes, cufflinks has been used. The samples are 3 rice cultivars along with the conditions control, dessication and salinity. So how can I proceed in such a case?

ADD REPLY
0
Entering edit mode
8.9 years ago
cyril-cros ▴ 950

I mainly use R for this task. Download an annotation of the rice gene CDS in gtf/gff format [but you got one I guess]. You can then use:

  • CummeRbund: the follow-up tool in the Cufflinks workflow. It turns your cufflinks file into a database and allow powerful statistical analysis. See the Cufflinks website, tool section.
  • The I-don't-trust-Cufflinks-a-lot route:

The idea is that you can choose how you count reads / normalize counts with DeSeq.

You can then use tools such as MultiExperimentViewer to cluster your most differentially expressed genes, and do gene ontology/enrichment search (see Go Finder).

Those steps are not trivial, I suggest you find someone experimented to help you in your lab. As a joke, my teachers asked my class to find the top 10 most differentially expressed genes in a simple data set and there were lots of differences between our answers. The thing that matters most is that you have an understanding of the assumptions you make at each step (most genes have a stable expression across your conditions / you disregard reads which align to several places / etc...). Choosing to keep only genes with a minimal FPKM is such an example...

Also, do statistical tests and look at p-values. You will always get a huge list of candidate gens, you need to select only the most relevant ones. Use RT-qPCR for confirmation.

ADD COMMENT
0
Entering edit mode

Yes it varies a lot. Any way thank you for the suggestion though there is no one to help. Let me proceed. It's a problem on how to normalize counts.

ADD REPLY

Login before adding your answer.

Traffic: 1952 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6