How to get a genome decontaminated after assemble
3
0
Entering edit mode
3.9 years ago
Damon_Wan • 0

Hello everyone, I am a rookie just start learning bioinformatics, and I have a question.

Now I have a fasta file that reads were assembled into contigs by CLC workbench and the same specie reference genome files. I want to know how could I filtrate the fasta file to get a decontaminated clean genome.

THANK YOU FOR ANSWERING!!!

genome Assembly • 2.6k views
ADD COMMENT
1
Entering edit mode

If you had suspected that there is contamination in your sequence data you should have tried to remove those sequences before assembling the data.

Do you have evidence that there is contamination in your assembly (you even used a reference genome of the same organism)?

ADD REPLY
0
Entering edit mode

Thank you for answering!

As I rechecked the workflow of CLC, turns out that no reference was used. It is a de novo assembly without references. The genome I want to assemble is a kind of protozoon. The data was produced by Illumina Hiseq 2500, so I think there must be some contamination of bacteria or other organism.

May I ask what should I do to deal with the assembled contigs file?

Thank you again!

ADD REPLY
2
Entering edit mode
3.9 years ago
h.mon 35k

I like BlobTools for checking assembly contamination, it combines blast results, read mapping coverage and GC content to explore possible contaminants. There is also a BlobTools2, but I've never used it - by its description, it is nicer than the previous version. BlobTools is a post-assembly evaluation tool.

You can also use sketches to analyse contamination, either on your raw data (pre-assembly), or on assemblies, see:

Mash Screen: what's in my sequencing run?

What’s in my metagenome?

Tool: BBSketch - A Tool for Rapid Sequence Comparison

Also pre- and post-assembly, you can also use kmer screening tools like Kraken, CLARK or Centrifuge to screen and filter out contaminants.

ADD COMMENT
1
Entering edit mode
3.9 years ago
Mensur Dlakic ★ 27k

If the contaminants are reasonably unrelated to the genome of interest, you can separate them after the assembly using t-SNE or UMAP embedding based on tetranucleotide frequencies. This will work in most cases, especially if you can afford to throw away contigs smaller than 5 Kb.

To illustrate this, I mixed 3 random contigs from human chromosome 1 with complete genomes of S. cerevisiae and E. coli. First you have 2D UMAP embedding, followed by t-SNE.

enter image description here enter image description here

ADD COMMENT
0
Entering edit mode

Thank you for answering!

I think I will find some papers and protocols and have a try in this way. Do you have some recommended parpers or other article, by the way.

ADD REPLY
1
Entering edit mode

My plots are from custom scripts, but there are plenty of similar solutions. I recommend VizBin if the ease of use and installation is your preference. Also:

Links to relevant publications are there as well.

ADD REPLY
0
Entering edit mode

Thank you very much!

ADD REPLY
0
Entering edit mode
3.9 years ago
vinicius ▴ 10

You can download the genome sequence of putative contaminants, merge them into a single multi-fasta file and then align your raw assembly against this multi-fasta, retaining only unmapped contigs.

ADD COMMENT

Login before adding your answer.

Traffic: 2653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6