Question

How to get a genome decontaminated after assemble

0

Entering edit mode

3.9 years ago

Damon_Wan • 0

Hello everyone, I am a rookie just start learning bioinformatics, and I have a question.

Now I have a fasta file that reads were assembled into contigs by CLC workbench and the same specie reference genome files. I want to know how could I filtrate the fasta file to get a decontaminated clean genome.

THANK YOU FOR ANSWERING!!!

genome Assembly • 2.6k views

ADD COMMENT • link updated 3.9 years ago by h.mon 35k • written 3.9 years ago by Damon_Wan • 0

1

Entering edit mode

If you had suspected that there is contamination in your sequence data you should have tried to remove those sequences before assembling the data.

Do you have evidence that there is contamination in your assembly (you even used a reference genome of the same organism)?

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

Thank you for answering！

As I rechecked the workflow of CLC, turns out that no reference was used. It is a de novo assembly without references. The genome I want to assemble is a kind of protozoon. The data was produced by Illumina Hiseq 2500, so I think there must be some contamination of bacteria or other organism.

May I ask what should I do to deal with the assembled contigs file?

Thank you again!

ADD REPLY • link 3.9 years ago by Damon_Wan • 0

score 2 · Answer 1 · 2020-06-01

I like BlobTools for checking assembly contamination, it combines blast results, read mapping coverage and GC content to explore possible contaminants. There is also a BlobTools2, but I've never used it - by its description, it is nicer than the previous version. BlobTools is a post-assembly evaluation tool.

You can also use sketches to analyse contamination, either on your raw data (pre-assembly), or on assemblies, see:

Mash Screen: what's in my sequencing run?

What’s in my metagenome?

Tool: BBSketch - A Tool for Rapid Sequence Comparison

Also pre- and post-assembly, you can also use kmer screening tools like Kraken, CLARK or Centrifuge to screen and filter out contaminants.

score 1 · Answer 2 · 2020-05-31

1

Entering edit mode

3.9 years ago

Mensur Dlakic ★ 27k

If the contaminants are reasonably unrelated to the genome of interest, you can separate them after the assembly using t-SNE or UMAP embedding based on tetranucleotide frequencies. This will work in most cases, especially if you can afford to throw away contigs smaller than 5 Kb.

To illustrate this, I mixed 3 random contigs from human chromosome 1 with complete genomes of S. cerevisiae and E. coli. First you have 2D UMAP embedding, followed by t-SNE.

enter image description here

ADD COMMENT • link 3.9 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you for answering!

I think I will find some papers and protocols and have a try in this way. Do you have some recommended parpers or other article, by the way.

ADD REPLY • link 3.9 years ago by Damon_Wan • 0

1

Entering edit mode

My plots are from custom scripts, but there are plenty of similar solutions. I recommend VizBin if the ease of use and installation is your preference. Also:

Links to relevant publications are there as well.

ADD REPLY • link 3.9 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you very much！

ADD REPLY • link 3.9 years ago by Damon_Wan • 0

score 0 · Answer 3 · 2020-05-31

0

Entering edit mode

3.9 years ago

vinicius ▴ 10

You can download the genome sequence of putative contaminants, merge them into a single multi-fasta file and then align your raw assembly against this multi-fasta, retaining only unmapped contigs.

ADD COMMENT • link 3.9 years ago by vinicius ▴ 10