Question

Truncate RNASeq data to get similar abundance

0

Entering edit mode

7.7 years ago

camelbbs ▴ 710

Hi,

We did RNA sequencing in 6 samples and got the results that the total reads number in those 6 samples were very different. For example:

sampleA1  sampleA2 sampleA3 sampleB1 sampleB2 sampleB3
150000     160000   180000   250000   260000   250000

Do we need to truncate the same number of reads for further analysis? Such as:

sampleA1  sampleA2 sampleA3 sampleB1 sampleB2 sampleB3
150000     150000   150000   150000   150000   150000

This is our sum of RPKM for each sample:

sample  s-655605    s-664561    s-665905    ZC1 ZC2 ZC3
total_rpkm  2336029.676 1846496.591 2262622.929 554911.8613 774240.5722 636009.5591

Very different between samples. The sum in S groups are 4 times than that in ZC groups. Anyone know the reason? We used total RNA and rRNA depleted library.

Thanks. Cam

rnaseq • 1.5k views

ADD COMMENT • link 7.7 years ago by camelbbs ▴ 710

2

Entering edit mode

Manually you should not correct them. If its differential expression analysis, the tools for DE analysis will take care of that, called normalisation.

Why the total number of reads are very low ?

ADD REPLY • link 7.7 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks, I just write the number for example. Actually I know the normalization process like DESeq. Asking this question because we found the sum of RPKM in each samples are very different. We speculate the reason is sequencing abundance are different.

ADD REPLY • link 7.7 years ago by camelbbs ▴ 710

0

Entering edit mode

It will never be same depth for 2 samples sequenced independently. Thats why we have to do library size ( total reads sequenced) normalisation.But the RPKM already normalises for sequencing depth.

ADD REPLY • link 7.7 years ago by GouthamAtla 12k

0

Entering edit mode

I modified my question, could you take a look again, Thanks.

ADD REPLY • link 7.7 years ago by camelbbs ▴ 710

0

Entering edit mode

First question : No, do not alterate your samples by removing reads. The abundances are not linearly measured. For example highly expressed genes tend to be more sequenced than in reality, and low expressed genes are less sequenced than in reality. My advice is to use normalized values. Do not use RPKM (lots of well detailed publications explain this fact), try to use specialised packages like DESeq or EdgeR that will better handle your samples and their differences.

2nd questions : RPKM divides the total number of reads by the size (in kb) of your reads. For example if in any samples you have a large amount of little genes that are expressed, it will divide your number of reads by something<1 that may explain differences.

ADD REPLY • link 7.7 years ago by Nibua ▴ 70