Question

Should you normalization metagenomic data

0

Entering edit mode

2.5 years ago

robert.murphy ▴ 80

I am given to understand you should not normalize when you have a smoother coverage and expected amount of data base on BBNorms documentation. Does this mean coverage over the resulting assembly or over something else? However ,given that normalisation is often used to make a data set computationally tractable I am sure my understanding must be incorrect?

Furthermore I have been informed that metagenomic data tends to have a more smooth coverage so then, should you ever normalize metagenomic data?

metagenomic coverage assembly • 1.6k views

ADD COMMENT • link 2.5 years ago by robert.murphy ▴ 80

score 3 · Accepted Answer · 2021-10-12

3

Entering edit mode

2.5 years ago

Istvan Albert 100k

Normalization is a word used in very different contexts and radically different interpretations.

In this particular case "normalization" by BBNorm is a type of downsampling of the data to make it more tractable during the assembly of a single genome. It detects regions with high coverage (where lots of reads fully overlap) and removes reads from those regions.

It is an error correction program to correct biases that occur when the sequencing library prep ended up with "runaway" regions that dominate. For good data, it should not be needed in the first place.

Run it only when you do have a problem of the type that the tool may be able to solve. But don't just run it default.

ADD COMMENT • link 2.5 years ago by Istvan Albert 100k

0

Entering edit mode

Thanks for the response. So BBNorm is aligning reads to each other then as this is all pre assembly?

Run it only when you do have a problem of the type that the tool may be able to solve. But don't just run it default.

An example being a large amount of data with unequal "coverage"?

I am still not fully understanding coverage here as in my mind coverage is the sequencing depth of raw reads over an assembly, but at this point we have not assembled!

ADD REPLY • link 2.5 years ago by robert.murphy ▴ 80

0

Entering edit mode

The innovation in these tools is exactly that counterintuitive result: the methods are able to figure out what reads will lead to unnaturally high coverage before assembling the reads into an assembly.

The simplest way to explain the method is that overlapping reads will share short subsequences (kmers). The tool will keep track and count these short kmers (say 10, 15bp long).

Reads that have kmers with unexpectedly high counts indicate that those reads will be overlapping with unexpectedly many other reads -> which in turn will correlate to the "future" coverage for those reads.

As for how to detect if you even need to run the tool? Well, yes, it is not that simple. One indication may be that you have way more data than you need. For example 10000x coverage, or that when you align your reads to a related organism you observe major variations in coverage. Or that the assembly does not seem to work etc ... there are many sources of problems and there is no single way to detect it.

ADD REPLY • link 2.5 years ago by Istvan Albert 100k

0

Entering edit mode

Thank you for the explanations. Given it is metagenomic data it will be quite hard to figure out answer to most of the how to know question you post I think?

I am understanding how it works now I think. It is a kmer density distribution sort of similar to Jellyfish?

Reads that have kmers with unexpectedly high counts indicate that those reads will be overlapping with unexpectedly many other reads -> which in turn will correlate to the "future" coverage for those reads.

Would that not only be the case for kmers as the edges of reads?

ADD REPLY • link 2.5 years ago by robert.murphy ▴ 80

2

Entering edit mode

The easiest is to work backward, suppose you will have coverage after assembly that looks like this: Each dash is a kmer, reads are made from multiple kmers, reads are space separated:

------ ------ ------ ------ ------ ------ -----
   ---- ---- ---- ----- --- ----- ----- ---- ----  ---
  --- --- --- -- ---- ---
--- --- --- ----- ---
 --- --- ---- ---- --- ---
--- -------- ---- ---- 
 --- ---- --- ---- ---- ---

How could we tell that we can drop half the reads over the highly covered area before producing the assembly that can give us the image above?

We see that the reads in the high coverage areas all share kmers that are highly and reasonably equally abundant, and many reads are formed from the same groups of abundant kmers. Thus we can drop some fraction of those reads without losing information in that region that is highly redundant.

Of course the rationale now critically depends on a number of factors, that the region be reasonably unique etc. So for that reason, these normalizations are rarely foolproof and can and will introduce various artifacts.

We only normalize when the benefits (usually demonstrated by being able to produce a high quality assembly) outweigh the costs.

The "proof" that normalization was indeed a good choice comes from the assembly working out better.

ADD REPLY • link 2.5 years ago by Istvan Albert 100k

0

Entering edit mode

Many thanks for the effort to explain it like that, it makes sense now! Next is then the mystical world assembly stats and what they actually tell us to know is normalization has been a benefit!

ADD REPLY • link 2.5 years ago by robert.murphy ▴ 80