Hi I have analysed two metagenomes (whole-genome shotgun; 5 and 8 million reads respectively, 101 bp reads) to look for antibiotic resistance genes and will then compare the results. I have got the raw resistance gene counts from each metagenome (e.g. 10 gotA, 20 fotG etc.). Now I want to normalize the data sets to compare them.
I have planned to use the number of 16S rRNA genes counts in each metagenome to normalize. So, I have extracted the 16S rRNA sequences from the metagenomes and assigned them to taxonomic classes (based on SILVA database). Then I realised If I want to normalise the resistance genes counts for each metagenome, I need the full length 16S rRNA gene counts but the short 101 reads can only say how many of those reads belong to 16S sequences if I dont assemble the reads (e.g. for metagenome 1, out of 5 million reads 40000 reads are 16S sequences and for Metagenome 2, out of 8 million sequences 45000 are 16S sequences ). That means I can't normalise the gene counts by the 16S read counts as multiple short 101 bp reads can be from the same 16S full length gene (~1582 bp). That means the 40000 16S gene counts could be from 10000 16S rRNA genes. But I dont have that full length info. How do you people normalize? do you consider gene length as well?