Question: Accounting for differences in FASTQ-file size, when comparing metagenomic gene abundance between samples
gravatar for Hansen_869
8 months ago by
Hansen_86920 wrote:

I have 8 count-matrix's (from bacterial metagenomic DNA sequencing), with information regarding fragment-count, meaning number of fragments aligned to each gene. I got the fragment-count, as opposed to read count, using Featurecounts. I have normalized for gene-length (longer genes will map more reads), by dividing the fragment count by the gene length. However, due to the variances in the size of the FASTQ-files, I wonder if i should normalize for that too somehow? My guess is that the bigger FASTQ files, will map more reads to the contigs, thus giving unequal numbers in regards to the samples with smaller FASTQ sizes. My final goal is to compare the gene abundances BETWEEN the 8 samples, so relative numbers are fine.

All 8 samples were sequenced equally and are coming from the same environment, but in different timepoints. But the FASTQ-files still vary in size by a couple of 100 MB.

tpm metagenomics reads gene • 238 views
ADD COMMENTlink modified 8 months ago by tshtatland60 • written 8 months ago by Hansen_86920
gravatar for tshtatland
8 months ago by
United States
tshtatland60 wrote:

I suggest to downsample all fastq files to the same number of reads prior to the analysis. This is a method commonly used in many other applications, such as RNA-seq and variant calling.

ADD COMMENTlink written 8 months ago by tshtatland60

Thanks for your response. I will look into that. Do you suggest i do any other form of normalisation? I read about TPM, RPKM and FPKM. Or do you think normalising for JUST gene length is sufficient in this type of study? In the mentioned techniques, READ length is taken into account, but due to the fact that the read length is the same for all the samples, i suppose it's redundant?

ADD REPLYlink modified 8 months ago • written 8 months ago by Hansen_86920

The rest of the normalization should be done as recommended in the metagenomics packages, which I assume depends on the package. Additional normalization for gene length using TPM, for example, still makes sense, even if you first downsample to the same number of reads.

ADD REPLYlink written 8 months ago by tshtatland60
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2032 users visited in the last hour