Question

Accounting for differences in FASTQ-file size, when comparing metagenomic gene abundance between samples

1

Entering edit mode

4.3 years ago

Hansen_869 ▴ 80

I have 8 count-matrix's (from bacterial metagenomic DNA sequencing), with information regarding fragment-count, meaning number of fragments aligned to each gene. I got the fragment-count, as opposed to read count, using Featurecounts. I have normalized for gene-length (longer genes will map more reads), by dividing the fragment count by the gene length. However, due to the variances in the size of the FASTQ-files, I wonder if i should normalize for that too somehow? My guess is that the bigger FASTQ files, will map more reads to the contigs, thus giving unequal numbers in regards to the samples with smaller FASTQ sizes. My final goal is to compare the gene abundances BETWEEN the 8 samples, so relative numbers are fine.

All 8 samples were sequenced equally and are coming from the same environment, but in different timepoints. But the FASTQ-files still vary in size by a couple of 100 MB.

gene metagenomics TPM Reads • 1.2k views

ADD COMMENT • link updated 4.3 years ago by tshtatland ▴ 190 • written 4.3 years ago by Hansen_869 ▴ 80

score 0 · Answer 1 · 2020-01-16

0

Entering edit mode

4.3 years ago

tshtatland ▴ 190

I suggest to downsample all fastq files to the same number of reads prior to the analysis. This is a method commonly used in many other applications, such as RNA-seq and variant calling.

ADD COMMENT • link 4.3 years ago by tshtatland ▴ 190

0

Entering edit mode

Thanks for your response. I will look into that. Do you suggest i do any other form of normalisation? I read about TPM, RPKM and FPKM. Or do you think normalising for JUST gene length is sufficient in this type of study? In the mentioned techniques, READ length is taken into account, but due to the fact that the read length is the same for all the samples, i suppose it's redundant?

ADD REPLY • link 4.3 years ago by Hansen_869 ▴ 80

0

Entering edit mode

The rest of the normalization should be done as recommended in the metagenomics packages, which I assume depends on the package. Additional normalization for gene length using TPM, for example, still makes sense, even if you first downsample to the same number of reads.

ADD REPLY • link 4.3 years ago by tshtatland ▴ 190