Question: How do I normalize for my RNA-seq data across different samples in different conditions
gravatar for crzy_azn_sean
23 months ago by
crzy_azn_sean0 wrote:

Hi, so i am pretty new to this whole computational process. I have no experience with any "R" packeges. I have only used commercial softwares.(CLC genomics workbench). It would be awsome if you guys can give me detailed suggestions and links so i can practice.

Before I begin let me describe my samples.

nnnnnnnnnnnn</a">Mapping results to reference genome

method: cDNA library 100bp paired-end -> illumina hiseq -> Raw data processed(TPM) with CLC genomics workbench

*GOAL: * I am trying to compare expression levels between samples Treated VS Not Treated (with replicates)

====================!!!!!So *here are the questions i have!!!!!*=================================

1.My treated sample had a small population when i isolated the total RNA for RNA-seq. I observed that the mapping % between Treated and Non-treated show a big difference(non treated control mapped around 17 times more compared to treated sample). Is there any way to normalize for this difference. It's not like i can just multiply 17 to the treated samples to make up for the low mapping %, right?

  1. When i normalize the raw data(Fastq), which normalization process must be considered? I have used TPM for transcript length and sequencing depth but I am new to this whole RNA-seq criteria and i need some serious help. any suggestions?

I dont have that much experience for computational processing. I would be greatful if you could give me detailed suggestions or methods.!!!!

ADD COMMENTlink modified 23 months ago by Kevin Blighe51k • written 23 months ago by crzy_azn_sean0
gravatar for Kevin Blighe
23 months ago by
Kevin Blighe51k
Kevin Blighe51k wrote:

My advice would be to avoid FPKM and TPM. In addition, people should cease the use of Cufflinks (unless they are bound by some legacy data produced by Cufflinks) and move toward HISAT2 / StringTie, which are major upgrades of TopHat2 / Cufflinks.

One worry I have is that you allude to your lack of experience in R. However, the pipeline that I'm about to describe below is well documented and there is virtually an entire tutorial for you to follow (to which I link at the end).

Thus, a more simple workflow for you:


1, Quantify read count abundances directly from FASTQ

From your FASTQ files, quantify read count abundances per sample using Kallisto or Salmon. As your reference transcriptome (over which read counts will be counted), you can use the GENCODE reference FASTA files, either just the protein coding RNAs (~21,000 transcripts) or the 'comprehensive' reference FASTA (~200,000 transcripts and isoforms), which includes protein coding RNA, all known non-coding RNA, non-sense mediated decay transcripts, and both processed and unprocessed pseudogenes.

These, and others including GTF files, are available here

2, import the counts into R using tximport

3, normalise and conduct differential epxression analysis in DESeq2

DESeq2's normalisation method, which is based on the determination of sizing factors per transcript across all samples in your dataset based on the geometric mean, will deal very well with differences in library size and also low count transcripts.

The best tutorial that you could follow is this one by the developers of DESeq2, recently updated in time for Christmas (a few days go): Analyzing RNA-seq data with DESeq2. In the tutorial, they allude to the use of Kallisto and tximport.


I'm sure that you'll have further queries, in which case ask them here or open a new question.

For further reading on RNA-seq normalisation, read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.


ADD COMMENTlink modified 9 months ago • written 23 months ago by Kevin Blighe51k

first of all thx for taking the time to response in detail!! I will defenatly read the suggestions you have made. Ill get back to you if I have any further problems!!! THX agaiN!!

ADD REPLYlink written 23 months ago by crzy_azn_sean0
gravatar for Hussain Ather
23 months ago by
Hussain Ather940
National Institutes of Health, Bethesda, MD
Hussain Ather940 wrote:

Reads can be normalized using fragments per kilobase of millions of reads (FPKM). You can use the Cufflinks to compare and find differentially expressed reads using FPKM.

ADD COMMENTlink written 23 months ago by Hussain Ather940

does this method account for the low mapping % of the treated samples>?

ADD REPLYlink written 23 months ago by crzy_azn_sean0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2096 users visited in the last hour