Question

How to normalise read count per gene

0

Entering edit mode

8.3 years ago

ChIP ▴ 600

Hi,

I am curious to know, how can I normalise number of reads per gene/gene_id generated by the STAR aligner (ReadsPerGene file).

In case it is non stranded sequencing and my data looks like this (first few lines)

N_unmapped    663817    663817    663817
N_multimapping    1232697    1232697    1232697
N_noFeature    5408215    8176752    5438049
N_ambiguous    108068    315    106321
NM_214429    0    0    0
NM_214220    0    0    0
NM_001143697    51    0    51
NM_001164649    1    0    1
NM_001044603    2    0    2
NM_001244242    479    1    478
NM_001244241    286    1    285
NM_001177907    428    0    428
NM_001171752    79    0    79
NM_001123219    13    0    13
NM_001256148    198    2    196
NM_001001771    9068    17    9051
NM_001123181    132    0    132
NM_001258303    2678    7    2671
NM_001244322    282    0    282
NM_001317080    1445    3    1442
NM_001195372    0    0    0
NM_214313    3752    2    3750
NM_001253922    0    0    0
NM_001007195    717    2    715
NM_001244473    1930    1    1929
NM_001044590    0    0    0
NM_001001538    0    0    0
NM_001128489    0    0    0
NM_213885    125    0    125
NM_001114280    181    0    181

Thank you in advance

RNA-Seq • 8.9k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.3 years ago by ChIP ▴ 600

Ram · Answer 1 · 2016-01-05

2

Entering edit mode

8.3 years ago

EagleEye 7.5k

You can use this script for RPKM normalization, if you have length of each corresponding genes in the last column. The column names must start with '#' (tab separated)

#Gene Sample1 Sample2 Sample3 Length_of_the_gene
Gene1 10 20 40 1200
Gene2 23 45 60 800

https://github.com/santhilalsubhash/rpkm_rnaseq_count

If you prefer R. Use this code. Prepare your input file in the above mentioned format.

p <- read.table("input_table.txt",sep="\t",header=T)
len <- ncol(p)
rownames(p) <- p[,1]

for(i in 2:ncol(p)-1)
{
    d <-p[,i]
    l <- p[,len] #accessing the length column
    cS <- sum(as.numeric(p[,i])) #Total mapped reads per sample
    rpkm[[i]] <- (10^9)*(as.numeric(p[,i]))/(as.numeric(l)*cS)
    rpkm[[1]] <- p[[1]]
}

write.table(rpkm,"output_table_rpkm.txt",sep="\t",quote=F,row.names=F)

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by EagleEye 7.5k

0

Entering edit mode

Hi,

Thank you. This is strand specific RNA-Seq. Where column 1 is gene name, column 2 total number of reads, column 3 reads on strand 1 and column 4 reads on 2 strand.

Will your R script still work?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by ChIP ▴ 600

0

Entering edit mode

I am sure that this will work. Because even if it strand specific the gene will normalize in strand specific manner. you will get normalization values for each strand based on the libraries of corresponding strands.

ADD REPLY • link 8.3 years ago by EagleEye 7.5k

0

Entering edit mode

As stated by zhangchao3, use edgeR. It includes a function to normalize to RPKMs directly, rpkm(), so there's no need to reinvent the wheel. Having said that, it's much better to normalize to TPMs (transcripts per million); it's the only one that holds between samples (look for this in PubMed), as compared to RPKMs and RPKs.

ADD REPLY • link 8.3 years ago by Israel Barrantes ▴ 790

0

Entering edit mode

I do not think there will be big difference in this case since all conditions will maintain the same gene length. Choosing the method is also depends on your goal. Whether you are looking for within sample variation (Ranking or finding enriched genes within samples) or you want to compare between different samples (differential expression between conditions). Here is the blog which clearly explains difference between RPKM, FPKM and TPM,

http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

You can modify the above code to calculate RPKM,FPKM or TPM. There will be just slight difference.

Example: To make the code work for TPM

First normalize the count for 'count/gene_length' and then by normalized lib_size (sum(count/gene_length)/1,000,000) (opposite to RPKM). Finally you divide it by 1,000,000 = [(sum(count/gene_length)/1,000,000) / 1,000,000]

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by EagleEye 7.5k

Ram · Answer 2 · 2016-01-05

1

Entering edit mode

8.3 years ago

Czh3 ▴ 190

Try DESeq or edgeR.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Czh3 ▴ 190