Question: How to get a genome decontaminated after assemble
0
gravatar for Damon_Wan
4 weeks ago by
Damon_Wan0
Damon_Wan0 wrote:

Hello everyone, I am a rookie just start learning bioinformatics, and I have a question.

Now I have a fasta file that reads were assembled into contigs by CLC workbench and the same specie reference genome files. I want to know how could I filtrate the fasta file to get a decontaminated clean genome.

THANK YOU FOR ANSWERING!!!

assembly genome • 135 views
ADD COMMENTlink modified 4 weeks ago by h.mon30k • written 4 weeks ago by Damon_Wan0
1

If you had suspected that there is contamination in your sequence data you should have tried to remove those sequences before assembling the data.

Do you have evidence that there is contamination in your assembly (you even used a reference genome of the same organism)?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax85k

Thank you for answering!

As I rechecked the workflow of CLC, turns out that no reference was used. It is a de novo assembly without references. The genome I want to assemble is a kind of protozoon. The data was produced by Illumina Hiseq 2500, so I think there must be some contamination of bacteria or other organism.

May I ask what should I do to deal with the assembled contigs file?

Thank you again!

ADD REPLYlink written 4 weeks ago by Damon_Wan0
2
gravatar for h.mon
4 weeks ago by
h.mon30k
Brazil
h.mon30k wrote:

I like BlobTools for checking assembly contamination, it combines blast results, read mapping coverage and GC content to explore possible contaminants. There is also a BlobTools2, but I've never used it - by its description, it is nicer than the previous version. BlobTools is a post-assembly evaluation tool.

You can also use sketches to analyse contamination, either on your raw data (pre-assembly), or on assemblies, see:

Mash Screen: what's in my sequencing run?

What’s in my metagenome?

Tool: BBSketch - A Tool for Rapid Sequence Comparison

Also pre- and post-assembly, you can also use kmer screening tools like Kraken, CLARK or Centrifuge to screen and filter out contaminants.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by h.mon30k
1
gravatar for Mensur Dlakic
4 weeks ago by
Mensur Dlakic5.8k
USA
Mensur Dlakic5.8k wrote:

If the contaminants are reasonably unrelated to the genome of interest, you can separate them after the assembly using t-SNE or UMAP embedding based on tetranucleotide frequencies. This will work in most cases, especially if you can afford to throw away contigs smaller than 5 Kb.

To illustrate this, I mixed 3 random contigs from human chromosome 1 with complete genomes of S. cerevisiae and E. coli. First you have 2D UMAP embedding, followed by t-SNE.

enter image description here enter image description here

ADD COMMENTlink written 4 weeks ago by Mensur Dlakic5.8k

Thank you for answering!

I think I will find some papers and protocols and have a try in this way. Do you have some recommended parpers or other article, by the way.

ADD REPLYlink written 4 weeks ago by Damon_Wan0
1

My plots are from custom scripts, but there are plenty of similar solutions. I recommend VizBin if the ease of use and installation is your preference. Also:

Links to relevant publications are there as well.

ADD REPLYlink written 4 weeks ago by Mensur Dlakic5.8k

Thank you very much!

ADD REPLYlink written 4 weeks ago by Damon_Wan0
0
gravatar for vinicius
4 weeks ago by
vinicius10
vinicius10 wrote:

You can download the genome sequence of putative contaminants, merge them into a single multi-fasta file and then align your raw assembly against this multi-fasta, retaining only unmapped contigs.

ADD COMMENTlink written 4 weeks ago by vinicius10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 947 users visited in the last hour