Hi, I have run a Hiseq 2x150bp from metagenomics samples.
I have computed the gene coverage for a list of genes for a given sample. For each sample i take a subset of genes i´m interested in and mapped all reads to each of these genes generating a bam file that I use as input to bedtools as follows
bedtools genomecov -ibam subsetgenes_sample_A.bam bedtools genomecov -ibam subsetgenes_sample_B.bam bedtools genomecov -ibam subsetgenes_sample_C.bam ....
Here below the output for one sample. I have added column names for clarity.
V1 V2 V3 V4 V5 k141_25_1 1 64 165 0.387879 k141_25_1 2 100 165 0.606061 k141_25_1 3 1 165 0.00606061 k141_34_1 2 11 726 0.0151515 k141_34_1 3 55 726 0.0757576 k141_34_1 4 36 726 0.0495868 k141_34_1 5 101 726 0.139118 k141_34_1 6 123 726 0.169421 k141_34_1 7 195 726 0.268595
In order to compute gene coverage i run
coverage = (V2*V3)/V4.
The problem i have is that i have several samples so ideally i should normalize each sample by the total number of reads and multiply by 1M.
Would it make sense to normalize as follows
normalized_genes = coverage(from above formula)/totalReadspersample * 1000000
Thanks for comments.