Entering edit mode
5.5 years ago
alim.hcu
▴
20
Dear I am analysing a RNA sequencing data of rice samples. Sequenced were mapped using tophat and most of sample had more than 90% mapping rate (although there were 50% multimapped reads). When I used cufflink for quantification of genes, i found more than 90% genes showed fpkm value less than zero. So i am getting very less number of differentially expressed genes. What may be the reason behind this?
Are you sure its fpkm < 0? Could you put a screenshot of the cufflinks file?
Sorry..it was typo error. .less than one..
What are you samples specifically? If you're looking at a single tissue type, or a single cell type then you'll have a majority of genes turned off. Genes necessary in flowering won't be expressed if this is a leaf study, and most photosynthetic genes should be turned off in the roots. Intuitively I wouldn't be surprised if 50-70% of genes were turned off depending on the study design.
I'd also wonder about more specifics of your reads/libraries. What is your read depth? Can you generate a histogram of FPKM values? Better yet, if you generate a density plot does the distribution look bimodal? You can graph this in R with:
plot(density(log10(fpkmValues)))
Getting a better feel for the data will help you understand if this result is an error or if it actually makes biological sense.RNA was isolated from 10 day old seedling plant. We have average 50 million reads per sample.
Please show all commands that you have used. As a community full of volunteers, we do not have much time to be constantly asking new questions to extract information from you. You need to provide all information in your opening question.
Have you run a PCA analysis with your data from the samples? It can tell you if the control and the treated samples are distinct enough for further differential gene expression analysis. Alternatively, you can try running STAR or hisat2 alignment followed by DESEq2 analysis. Cufflinks normalizes the reads coming from the transcripts wrt the length, therefore long transcripts with not so many reads are often represented with lower fpkm, and therefore you might lose the information about how much of the actual number of reads are coming from the transcripts. DESeq2 works on actual number of reads without the normalization to length, hence you lose less information.
Thanks for your kind reply....
As you suggested that i should proceed with HIsat2. Here i got following output from hisat2. I think the problem is with multimapped reads. Due to 49% multimapped reads, only 35% reads aligned uniquely, so less number of reads for quantification (as htseq-count quantify only uniquely mapped reads). Hence genes showing less fpkm value.
hisat2 -q -p 10 -x Oryza_sativa -1 /home/jkt/Software/TrimGalore-0.5.0/S1_JD_R1_val_1.fq.gz -2 /home/jkt/Software/TrimGalore-0.5.0/S1_JD_R2_val_2.fq.gz -S JD1_Rice.sam
36772695 reads; of these: 36772695 (100.00%) were paired; of these: 5696023 (15.49%) aligned concordantly 0 times 12980109 (35.30%) aligned concordantly exactly 1 time 18096563 (49.21%) aligned concordantly >1 times
95.39% overall alignment rate
Cool - that seems like the problem (multi-mapped reads), as you imply.