The purpose of my research, is to look at the gene abundance in a metagenomic sample (DNAseq). So it is not an expression analysis, since I am looking at genomic DNA. Furthermore, I am looking at different metagenomic samples, taken from different environments (specifically looking at bacteria). I want to compare the gene abundance between the samples (but also the genes to each other, within the sample). Hence, I need a proper way to normalize my genes. I expect a lot of the genes to be similar, but the abundance to differ (due to reduced/increased amount of bacteria in the sample). My data is an alignment of the RAW fastq-reads mapped to my genes. I used featurecounts to count fragments (I have paired-end data) instead of reads.
My concern is that the genes will differ in length, so the number of fragments aligned to each gene, won't tell me much about its abundance (a very long gene, will have lots of fragments mapped, but that doesn't necessarily mean that it has a high abundance).
My count matrix is of the format:
GENE_ID - GENE_FUNCTION - FRAGMENTS_MAPPED - LENGTH_OF_GENE
My first thought was to apply the following:
(number of fragments mapped) x (average fragment length) / (length of gene)
However, that was just a thought. I believe there must be more accurate ways? I have heard about MUSiCC and DESeq2, however, are those tools suitable for my kind of reseach?
Thanks a lot in advance!