Entering edit mode
                    9.8 years ago
        ChIP
        
    
        ▴
    
    600
    Hi,
I am curious to know, how can I normalise number of reads per gene/gene_id generated by the STAR aligner (ReadsPerGene file).
In case it is non stranded sequencing and my data looks like this (first few lines)
N_unmapped    663817    663817    663817
N_multimapping    1232697    1232697    1232697
N_noFeature    5408215    8176752    5438049
N_ambiguous    108068    315    106321
NM_214429    0    0    0
NM_214220    0    0    0
NM_001143697    51    0    51
NM_001164649    1    0    1
NM_001044603    2    0    2
NM_001244242    479    1    478
NM_001244241    286    1    285
NM_001177907    428    0    428
NM_001171752    79    0    79
NM_001123219    13    0    13
NM_001256148    198    2    196
NM_001001771    9068    17    9051
NM_001123181    132    0    132
NM_001258303    2678    7    2671
NM_001244322    282    0    282
NM_001317080    1445    3    1442
NM_001195372    0    0    0
NM_214313    3752    2    3750
NM_001253922    0    0    0
NM_001007195    717    2    715
NM_001244473    1930    1    1929
NM_001044590    0    0    0
NM_001001538    0    0    0
NM_001128489    0    0    0
NM_213885    125    0    125
NM_001114280    181    0    181
Thank you in advance
Hi,
Thank you. This is strand specific RNA-Seq. Where column 1 is gene name, column 2 total number of reads, column 3 reads on strand 1 and column 4 reads on 2 strand.
Will your R script still work?
I am sure that this will work. Because even if it strand specific the gene will normalize in strand specific manner. you will get normalization values for each strand based on the libraries of corresponding strands.
As stated by zhangchao3, use edgeR. It includes a function to normalize to RPKMs directly, rpkm(), so there's no need to reinvent the wheel. Having said that, it's much better to normalize to TPMs (transcripts per million); it's the only one that holds between samples (look for this in PubMed), as compared to RPKMs and RPKs.
I do not think there will be big difference in this case since all conditions will maintain the same gene length. Choosing the method is also depends on your goal. Whether you are looking for within sample variation (Ranking or finding enriched genes within samples) or you want to compare between different samples (differential expression between conditions). Here is the blog which clearly explains difference between RPKM, FPKM and TPM,
http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
You can modify the above code to calculate RPKM,FPKM or TPM. There will be just slight difference.
Example: To make the code work for TPM
First normalize the count for 'count/gene_length' and then by normalized lib_size
(sum(count/gene_length)/1,000,000)(opposite to RPKM). Finally you divide it by 1,000,000 =[(sum(count/gene_length)/1,000,000) / 1,000,000]