Hi,
I am dealing now with the simple NGS results containing 19 samples with different counts of reads for each sample, for example the smallest library has 114793 reads and the biggest 242798 reads. Each library present 53 amplicons (2 genes).
Do I have to normalize those reads? Can I use RPKM/FPKM? and if yes, how can I do that?
Next step of my analysis will include calculation of some stats like coverage per amplicon for every sample and I think the results won’t be correct without previous normalization of reads, am I right?
I would appreciate for any help.
Best regards,
Agata
We need to know the biological question you're trying to answer to provide useful feedback.
What do you mean by "biological question"? Is my post unclear?
Yes, I am dealing with DNA not RNA.
I would like to do more complex stats like: list the amplicons covered at least 50X in all 19 samples.
For example: one amplicon from sample 1 has a mean coverage 30X, and the total read count for this sample is 30 000, in the sample 2, the same amplicon have 50X coverage with total read count 40 000. I think I cannot compare coverage between those two, right? But if I do some normalization and the total read count will be the same for all samples, the comparison will be OK.
But I don't have any idea how to do that.
Hope I made it clear.
Again, whether you need to normalize or not depends on the conclusions that you want to draw from these sorts of summary statistics. There's typically no need to normalize for total number of reads/sample when calculating coverage, at least unless you need to do some differential comparisons (given that this is amplicon data it's highly questionable if any sort of differential coverage comparison would even be meaningful).
Exactly. Unless you are trying to say, detect copy number variation or something you won't need to normalize. And I would strongly encourage you to NOT attempt anything like that with amplicon data. Coverage stats and simple things like that are essentially qualitative measures of your data used for summarizing how well your experiment went and informing you of any potential gaps in your sequencing (for instance to see if you may have false negatives in a given region of sequencing when looking for variants because of lack of sufficient coverage depth)