Hi I have analysed two metagenomes (whole-genome shotgun; 5 and 8 million reads respectively, 101 bp reads) to look for antibiotic resistance genes and will then compare the results. I have got the raw resistance gene counts from each metagenome (e.g. 10 gotA, 20 fotG etc.). Now I want to normalize the data sets to compare them.

I have planned to use the number of 16S rRNA genes counts in each metagenome to normalize.  So, I have extracted the 16S rRNA sequences from the metagenomes and assigned them to taxonomic classes (based on SILVA database). Then I realised If I want to normalise the resistance genes counts for each metagenome, I need the full length 16S rRNA gene counts but the short 101 reads can only say how many of those reads belong to 16S sequences if i dont assemble the reads (e.g. for metagenome 1, out of 5 million reads 40000 reads are 16S sequences and for Metagenome 2, out of 8 million sequences 45000 are 16S sequences ). That means I can't normalise the gene counts by the 16S read counts as multiple short 101 bp reads can be from the same 16S full length gene (~1582 bp). That means the 40000 16S gene counts could be from 10000 16S rRNA genes. But I dont have that full length info. How do you people normalize? do you consider gene length as well?

You may be able to improve the situation by looking at only the V4 region of 16S, which is much shorter; maybe 250bp, or a bit longer.  If your reads are paired, and the insert size is largely under 200bp, you can merge them into single longer reads (up to ~190bp for 2x101bp) with BBMerge, which will greatly increase specificity.

V4 is not as good as the full 16S, of course, but it is commonly used as a proxy when full-length information is unavailable.  For metagenomes, if you want full-length 16S, you have to go with PacBio rather than Illumina.  And for V4 on Illumina, it is MUCH better to go with longer read lengths of 2x150, 2x250, or ideally 2x300bp, which you can do on MiSeq.

