I am toying with a new Next Gen Sequencing dataset in which each sequence is tagged according to the individual from which it was extracted. In this 454 experiment, we received about 1.8 million sequences in total. cDNA was the starting material for this experiment so, in each contig (or gene), the number of reads from an individual is correlated to the level of expression of that gene in that individual.
What are the normalization steps that should be applied to the sequence counts per individuals in order to be able to use these measures as a 'level of expression'?
The two that come to mind immediately are:
- Divide by the total number of sequences in each experimental group
- Divide by the number of sequences in each individual
What else do you think should be done?