de novo sequence assembly with extremely high coverage
5
2
Entering edit mode
6.7 years ago
ldorsey6 ▴ 20

I am trying to do a de novo sequence assembly of a 7500 bp virus with Illumina data. The coverage is very high (in the millions), so every piece of software I've tried to use is unable to complete this task. Are there any programs out there that are able to handle this? Any advice would be appreciated.

de-novo-sequence-assembly Illumina high-coverage • 5.8k views
2
Entering edit mode

Hi,

I would recommend to select a random subset of your reads to perform a "classic" assembly from your data. In fact, having a "too" high coverage can lead to a wrong assembly. I have seen artefactual repetitions of wide genomic regions in those kind of assemblies.

5
Entering edit mode
6.7 years ago
thackl ★ 2.9k

Don't try to assembly all your data at once. Determine the approximate coverage, for example using a kmer-frequency plot, and down-sample by randomly drawing reads for about 100X.

0
Entering edit mode

Thanks. I'll need to get approval from a supervisor to cut down on the coverage -- are there any risks involved with random selection and reducing the coverage?

0
Entering edit mode

You can do this multiple times and see if the results are consistent by aligning final assemblies. If they are, there are no worries. Another thing to try would be digital normalization.

0
Entering edit mode

Out of curiosity, how would this be different than the approach used by Trinity's normalization?

0
Entering edit mode

I'm not entirely sure about Trinity's specifically, but in general normalization results on downsampling redundant stuff, while random drawing - well it's random. For example, if you normalize a set with let's say a 500X contig (A) and a 50X contig (B) to 50X, than you ideally will have 50X for both in the normalized read set. If you are interested in both 50X and 500X at the same time, use normalization. But be aware, since most errors are low coverage, their coverage will not be reduced, same goes for contaminations. Meaning if you have 10X erroneous reads on your 500X contig, this initially is a low ratio, but after normalization you may still have 10X erroneous reads, but only 50X good reads. Essentially, you have to be aware that normalization can increase the noise to signal ratio

If you only need 500X stuff, as ldorsey6 does, than random sampling is better. It will bring A to 5X and B to 50X and the 5X fraction should get filtered during assembly and not interfere with what you are actually interested in. This strategy can be improved by filtering the subsampled read set for coverage/error correction.

5
Entering edit mode
6.7 years ago

I recommend Tadpole, from the BBMap package - it does a good job on small genomes with super-high coverage, such as viruses. You can also specify a coverage range to assemble. For example:

tadpole.sh in=reads.fq out=contigs.fa k=60 mincr=100000


...will only assemble portions of the genome with coverage of at least 100,000x.

If you want to use a different assembler, you can also use BBNorm to reduce the coverage to 100x, like this:

bbnorm.sh in=reads.fq out=normalized.fq target=100


Normalization is generally better than subsampling if the coverage distribution is irregular.

0
Entering edit mode

I'm really a fan of bbnorm (just saved with a 3GB plant genome assembly), but don't you think that in this case normalization would not be ideal, because you enrich for all the noisy low coverage stuff while only downsampling the reads you are interested in?

2
Entering edit mode

BBNorm by default uses 2-pass mode. In the first pass, it attempts to selectively remove reads containing errors (meaning they have both high and low coverage regions, with a sharp discontinuity) while in the second pass it removes all reads with equal weight regardless of whether they seem to contain errors. This makes it fairly robust against concentrating reads with errors, which would happen using 1-pass normalization. It does not really help against low-coverage contaminants and so forth. But, you can also explicitly direct it to discard that junk:

bbnorm.sh in=reads.fq out=normalized.fq target=100 min=400
___________________________________________________^^^^^^^


In that case, it would discard anything with depth under 400x, and normalize everything else to 100x. There's an interplay between 2-pass mode and setting a high value for min, though, so it may be better to do that in 2 commands like this:

bbnorm.sh in=reads.fq out=highpass.fq target=9999999 min=400 passes=1
bbnorm.sh in=highpass.fq out=normalized.fq target=100


Normalization does not always work well on viruses, though, as they (at least, the ones I have looked at) seem to be extremely polymorphic.

0
Entering edit mode

Tadpole is working great for me. Thanks for the recommendation. I've been aligning the contigs to a reference to see if the de novo assembly is working, and I'm gettting 100% identity. However, tadpole is producing separate contigs that definitely overlap. Why is it producing 5 overlapping contigs instead of just 1?

0
Entering edit mode

How much do they overlap? If they have 100% identity to a reference, they will not overlap by more than a kmer length, because a given kmer will only occur at most once in the output.

The reason for there being 5 contigs is likely polymorphism in the data, or repeats in the genome, though it's hard to say. You may be able to reduce the number of contigs by adding the flag bm1=8, or by first error-correcting the data, or by using a longer kmer like 90 (or if you have 150bp reads, even k=120).

0
Entering edit mode

They overlaps are less then the length of the kmer, your explanation makes sense. That flag helped. Can you explain to me what that flag does?

2
Entering edit mode

"bm1" or "branchmult1" has a default value of 20. It is the minimum allowed ratio between the depths of two possible next kmers when extending a contig (a branch in the graph). So, for example, say you have a contig ACGTTA and you can extend it to the right by adding a C to make ACGTTAC or a G to make ACGTTAG. If there are 10 reads supporting the C and 50 reads supporting the G, then the ratio is 50/10=5, which is too low, and extension would stop. But if there are 10 reads supporting C and 500 supporting G, then the ratio is 50 which would pass the ratio so extension would continue, adding a G.

A higher value is more conservative and leads to a more fragmented assembly, but a lower rate of misassemblies and errors. A lower value will usually give a less fragmented assembly, but with a potentially higher rate of misassemblies and errors.

0
Entering edit mode
6.7 years ago
shwethacm ▴ 230

Hi ldorsey6,

you could downsample a certain coverage (~200-400x) of reads and assemble to get an initial assembly.

You can use the rest of the reads to then fill in the gaps or superscaffold your starting assembly. That way, you will utilize all your data and not overwhelm the assembler.

0
Entering edit mode
6.7 years ago
arnstrm ★ 1.8k

Digital Normalization. For example, KHMER can be used to decrease sampling variation, discard redundant data, and removing the majority of errors to retain only useful information.