Question: What are SizeFactors
1
gravatar for 2405592M
4 weeks ago by
2405592M20
2405592M20 wrote:

Hi Guys, Im new to RNA-seq and R-programming so forgive my ignorance in advance! I'm currently using a programme/script to help me map tRNAs (the supplied notes don't explain in much detail), and after tRNA counts are generated in all my conditions, they use SizeFactors to normalize the dataset (in DESeq2). I've tried to read up on what exactly SizeFactors are and I don't understand it. Could anyone give me an easy to understand definition of what size factors are and why they're used to normalize the data.

normalization rna-seq R • 163 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by 2405592M20

Hi i.sudbery! Thank you so much for your response. Thats the best explanation I've found so far! Cheers pal!

ADD REPLYlink written 4 weeks ago by 2405592M20

2405592M : If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLYlink written 4 weeks ago by genomax62k
13
gravatar for i.sudbery
4 weeks ago by
i.sudbery3.8k
Sheffield, UK
i.sudbery3.8k wrote:

A size factor relates to how many reads there are in each library. One can imagine that if you had two sample where 10% of the reads in each sample were from gene A, but in one sample 1M reads have been sequenced and in the other sample 2M reads had been sequenced then there would be a two fold increase in the counts from gene A in sample 2 compared to sample 1, but the actaul expression levels were the same.

Early RNAseq analysis divided counts by the total number of reads in each library, but this is poor practice for two reasons.

  1. Using division means that you lose the discrete nature of the gene counts, and thus negative bionomial statisitcs no longer apply. Thus normalising factors are used as offsets in the linear model, rather than divisors.
  2. In most RNAseq samples the most higly expressed genes take up the majority of the reads. Thus in a 1M read sample, if 300k reads came from gene A (leaving 700k for all the others), and that gene doubled in expression to 600k reads (leaving 400k for all the others), the expression of the other genes would appear to half, even though they have stayed the same.

Thus sizeFactors are related to the library size (total number of reads in the library), but are calculated in such a way as to compensate for effect 2 above. One common method (and the one I believe that is used by DESeq2), is to find the 75th centile of the distribution of read counts for each sample, and then calculate a normalisation factor such that the 75th centile is the same across all samples.

ADD COMMENTlink written 4 weeks ago by i.sudbery3.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1446 users visited in the last hour