Question

Criteria for filtering contigs after spades assembly

4

Entering edit mode

6.3 years ago

fhsantanna ▴ 610

I have assembled a bacterial genome using spades. The coverage among the contigs is very heterogeneous, while some have more than 30 fold of coverage, other have less than 5 fold. Also, there are a lot of contigs with a length less than 500 bp. I'm aware that I must filter contigs having low coverage and small length (they probably are spurious sequencing products), however what would be reasonable criteria for filtering them? I have found scripts that remove contigs having less than 500 pb and 2 fold coverage. I believe this setting is too permissive, since most of the contigs of my assembly significantly deviate from it.

EDIT: Here is the distribution of log10 contig length:

Here is the distribution of contig coverage:

As we can see, most of contigs have less than 1000 bp, Most of these contigs have less than 5 fold of coverage. My plan is to filter contigs having less than 1000 bp and 5 fold of coverage. After, I am going to check the completeness of the filtered genome sequence using checkm (contrasting to the unfiltered genome).

EDIT 2: Length vs Coverage

spades Assembly contigs filtering • 9.3k views

ADD COMMENT • link updated 5.1 years ago by bhclement • 0 • written 6.3 years ago by fhsantanna ▴ 610

0

Entering edit mode

I think it is a very good question, the answer may depends what you are looking for, firstable, a de novo assembly is absolutely necesary? or a guided assembly is possible? Which is the lenght expected to be? how many reads do you have? paired? un-paired? lenght? I. E. If you have sequenced 100x of the expected length may be you can ignore contigs which coverage is lower than 10x, but if you sequenced 30x, may be you think about it twice.

ADD REPLY • link 6.3 years ago by Buffo ★ 2.4k

0

Entering edit mode

Yes, it is necessary a de novo assembly, because we have sequenced a novel bacterial species. The expected length is between 5 - 6 Mpb. We have about 840,000 paired reads of ~250 bp (a raw coverage of about 80x, (2 x 250 x 840000)/5000000). After genome assembly I have 13 contigs having more than 20 fold of coverage and the rest of them has less than 1kb and less than 5 fold of coverage. In this case it seems pretty clear that I should remove the latter contigs. However I would like to have a rationale of filtering, because we have other genomes with 10 fold of coverage. In this case it is less intuitive to decide which is the proper threshold for contig filtering. Maybe I should plot the distribution of coverage and length in order to decide the best cutoffs (length and coverage).

ADD REPLY • link 6.3 years ago by fhsantanna ▴ 610

0

Entering edit mode

Hello, How do you calculate the coverage of each contig? Thanks.

ADD REPLY • link 5.1 years ago by bhclement • 0

1

Entering edit mode

6.3 years ago

shwethacm ▴ 240

Why don't you map your input reads back to the assembly and calculate the mean depth ? 2 std deviations from the mean could be your cutoff!

ADD COMMENT • link 6.3 years ago by shwethacm ▴ 240

0

Entering edit mode

Good idea. I'll give it a try.

ADD REPLY • link 6.3 years ago by fhsantanna ▴ 610

0

Entering edit mode

Hi Shwethacm, would please explain how to map read back to the assembly for calculating mean depth?

ADD REPLY • link 6.0 years ago by mohamed_refaat.1992 • 0

score 3 · Accepted Answer · 2018-01-19

3

Entering edit mode

6.3 years ago

shelkmike ★ 1.2k

If you're afraid of contamination or spurious sequencing products, I suggest to align by BLAST all of the contigs to some reference (or even to the NCBI NT or NCBI NR databases entirely), to decide which of the contigs are "Ok" (i.e. have matches to the species you are interested in). Then you can plot the distribution of coverages of good and bad contigs and define some threshold that separates them good enough

ADD COMMENT • link 6.3 years ago by shelkmike ★ 1.2k

0

Entering edit mode

Interesting idea. But I believe this is not the problem. I blasted the small contigs (<1kb), and all of them matched to the expected bacterial genus.

EDIT: In fact, I checked the contigs from other strain. I'll have to verify again if the small contigs are contamination.

ADD REPLY • link 6.3 years ago by fhsantanna ▴ 610

1

Entering edit mode

Also, I must note that your coverage is low. For bacterial genome assembly, average genome coverage of at least 50x is preferred. By the way, I assembled many bacterial genomes and have never seen such a heterogeneous coverage as in your diagram. Really looks like a mixture of genomes of several species with different coverages.

ADD REPLY • link 6.3 years ago by shelkmike ★ 1.2k

score 3 · Accepted Answer · 2018-01-23

3

Entering edit mode

6.3 years ago

h.mon 35k

BlobTools is a great tool to investigate contamination in (expectedly) single-organism assemblies.

ADD COMMENT • link 6.3 years ago by h.mon 35k