Question: Criteria for filtering contigs after spades assembly
1
gravatar for fhsantanna
10 months ago by
fhsantanna410
Brazil
fhsantanna410 wrote:

I have assembled a bacterial genome using spades. The coverage among the contigs is very heterogeneous, while some have more than 30 fold of coverage, other have less than 5 fold. Also, there are a lot of contigs with a length less than 500 bp. I'm aware that I must filter contigs having low coverage and small length (they probably are spurious sequencing products), however what would be reasonable criteria for filtering them? I have found scripts that remove contigs having less than 500 pb and 2 fold coverage. I believe this setting is too permissive, since most of the contigs of my assembly significantly deviate from it.

EDIT: Here is the distribution of log10 contig length:

Here is the distribution of contig coverage:

As we can see, most of contigs have less than 1000 bp, Most of these contigs have less than 5 fold of coverage. My plan is to filter contigs having less than 1000 bp and 5 fold of coverage. After, I am going to check the completeness of the filtered genome sequence using checkm (contrasting to the unfiltered genome).

EDIT 2: Length vs Coverage

filtering contigs spades assembly • 1.3k views
ADD COMMENTlink modified 9 months ago • written 10 months ago by fhsantanna410

I think it is a very good question, the answer may depends what you are looking for, firstable, a de novo assembly is absolutely necesary? or a guided assembly is possible? Which is the lenght expected to be? how many reads do you have? paired? un-paired? lenght? I. E. If you have sequenced 100x of the expected length may be you can ignore contigs which coverage is lower than 10x, but if you sequenced 30x, may be you think about it twice.

ADD REPLYlink written 10 months ago by Buffo1.2k

Yes, it is necessary a de novo assembly, because we have sequenced a novel bacterial species. The expected length is between 5 - 6 Mpb. We have about 840,000 paired reads of ~250 bp (a raw coverage of about 80x, (2 x 250 x 840000)/5000000). After genome assembly I have 13 contigs having more than 20 fold of coverage and the rest of them has less than 1kb and less than 5 fold of coverage. In this case it seems pretty clear that I should remove the latter contigs. However I would like to have a rationale of filtering, because we have other genomes with 10 fold of coverage. In this case it is less intuitive to decide which is the proper threshold for contig filtering. Maybe I should plot the distribution of coverage and length in order to decide the best cutoffs (length and coverage).

ADD REPLYlink modified 10 months ago • written 10 months ago by fhsantanna410
3
gravatar for h.mon
9 months ago by
h.mon21k
Brazil
h.mon21k wrote:

BlobTools is a great tool to investigate contamination in (expectedly) single-organism assemblies.

ADD COMMENTlink written 9 months ago by h.mon21k
2
gravatar for shelkmike
10 months ago by
shelkmike70
Russian Federation
shelkmike70 wrote:

If you're afraid of contamination or spurious sequencing products, I suggest to align by BLAST all of the contigs to some reference (or even to the NCBI NT or NCBI NR databases entirely), to decide which of the contigs are "Ok" (i.e. have matches to the species you are interested in). Then you can plot the distribution of coverages of good and bad contigs and define some threshold that separates them good enough

ADD COMMENTlink written 10 months ago by shelkmike70

Interesting idea. But I believe this is not the problem. I blasted the small contigs (<1kb), and all of them matched to the expected bacterial genus.

EDIT: In fact, I checked the contigs from other strain. I'll have to verify again if the small contigs are contamination.

ADD REPLYlink modified 9 months ago • written 9 months ago by fhsantanna410
1

Also, I must note that your coverage is low. For bacterial genome assembly, average genome coverage of at least 50x is preferred. By the way, I assembled many bacterial genomes and have never seen such a heterogeneous coverage as in your diagram. Really looks like a mixture of genomes of several species with different coverages.

ADD REPLYlink written 9 months ago by shelkmike70
1
gravatar for shwethacm
10 months ago by
shwethacm200
Seattle, WA
shwethacm200 wrote:

Why don't you map your input reads back to the assembly and calculate the mean depth ? 2 std deviations from the mean could be your cutoff!

ADD COMMENTlink written 10 months ago by shwethacm200

Good idea. I'll give it a try.

ADD REPLYlink written 9 months ago by fhsantanna410

Hi Shwethacm, would please explain how to map read back to the assembly for calculating mean depth?

ADD REPLYlink written 6 months ago by mohamed_refaat.19920
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 768 users visited in the last hour