Question

Extremely high variance of coverage for metagenomic sample (variance-to-mean ratio ~2.5)

0

Entering edit mode

7 weeks ago

k-tarasov • 0

Hello, everyone!

I got 4 separate Illumina sequencing runs for 4 different samples of microbial biofilm from the same location. I assembled contigs using reads from all 4 samples together (used Spades for that), then mapped reads from each sample separately. Then I used MetaBAT's jgi_summarize_bam_contig_depths and got values of variance-to-mean ratio of coverage depth around 2 (1.9538, 1.8905, 1.9579, 1.896 for 4 samples, respectively). For me it looked quite strange, but OK, if we sequenced with not enough depth, it might be so. Here's first 4 lines of a depth file:

contigName                  contigLen totalAvgDepth al_1.bam al_1.bam-var al_2.bam al_2.bam-var al_3.bam al_3.bam-var al_4.bam al_4.bam-var

NODE_1_length_881468_cov_46.839432  881468  93.7107 19.6243 38.7464 18.0613 33.7377 19.0095 36.5217 37.0157 76.8248
NODE_2_length_803954_cov_9.260631   803954  18.291  3.18857 5.75179 1.54228 2.79595 1.66402 3.13238 11.8961 22.8288
NODE_3_length_757581_cov_10.170569  757581  20.1714 6.60012 12.3694 5.41781 9.69118 6.27614 12.0979 1.87738 3.31236
NODE_4_length_652487_cov_11.289944  652487  22.0799 4.12398 7.45103 3.88173 6.96793 4.42189 8.52849 9.65228 18.577

I thought that if I merge all reads from all samples and map them again I'll get same mean coverage but with much lower variance-to-mean ratio since mapped reads originate from random regions. I've done this and was even more puzzled cause variance-to-mean ratio did not become any better, it became even bigger (2.2065):

contigName                        contigLen totalAvgDepth al_merged.bam al_merged.bam-var

NODE_1_length_881468_cov_46.839432  881468  93.7173 93.7173 217.989
NODE_2_length_803954_cov_9.260631   803954  18.2896 18.2896 34.3618
NODE_3_length_757581_cov_10.170569  757581  20.173  20.173  38.5595
NODE_4_length_652487_cov_11.289944  652487  22.0798 22.0798 43.0854

I thought that this might be caused by non-specific mapping of reads, that originate from different contig, to conservative regions, so I set percentIdentity parameter value to 100. No more reads in alignment with even single mismatch. I got coverage decreased by one third (that's OK) but again even bigger variance-to-mean ratio (2.3438)!

contigName                                         contigLen totalAvgDepth al_merged_100.bam al_merged_100.bam-var

NODE_1_length_881468_cov_46.839432  881468  61.5177 61.5177 180.272
NODE_2_length_803954_cov_9.260631   803954  13.4113 13.4113 24.8087
NODE_3_length_757581_cov_10.170569  757581  14.8889 14.8889 28.014
NODE_4_length_652487_cov_11.289944  652487  15.8534 15.8534 29.7099

And now time for questions: 1) What are the typical values of variance-to-mean ratio you get when dealing with metagenomic samples? Is it normal that I got them ~2? 2) How could you explain the fact that when I merged reads from all samples I got BIGGER variance-to-mean ratio? 3) And what about the next step when I set threshold for mapping identity up to 100. The conservative regions must have been getting much less mapped reads while variable regions have been recruiting the same amounts, so theoretically variance must have dropped. But nevertheless it became bigger.

Wish you all the best, Kirill.

MetaBAT metagenome coverage • 331 views

ADD COMMENT • link 6 weeks ago by k-tarasov • 0

score 0 · Answer 1 · 2024-06-10

First question is resolved. What I realized is that I was thinking all the way through about standard deviation, not variance, when dealing wtih variance-to-mean ratio. It was a matter of translation, - I'm not a native english speaker. So, in fact, value of variance-to-mean ratio ~2 is not that that bad. Standard deviation is square root of variance and "Standard deviation-to-mean" ratio is 0.2. Second and third questions still puzzle me.