Question

Contaminating Sequences And Genome Assembly

0

Entering edit mode

12.5 years ago

Fabian Bull ★ 1.3k

I just realised that my newest assembly contains contamination (E. coli) so my question is:

How to deal with contamination in de novo plant assemblies?

Do you know of any database I can blast against to find contamination? Do you know any other good pipeline to remove contamination?

assembly • 4.6k views

ADD COMMENT • link updated 12.5 years ago by Swbarnes2 ★ 1.6k • written 12.5 years ago by Fabian Bull ★ 1.3k

score 1 · Answer 1 · 2011-10-20

You could try to identify the contaminating entries and remove them. There are sequence databases of plasmids and cloning vectors (BACs, YACs, etc) that you can screen against. You can also screen against the E. coli reference genome (access to the sequence data of dozens of E. coli strains are listed here).

You must remove or mask the contaminating E. coli because it will wreak havoc on your assembly. It may very well be that you need to go back to the reads, and remove bacterial, cloning vector and other contaminants at that level. Then, reassemble in order to get the best (longest, fewest in number) genome contigs.

score 1 · Answer 2 · 2011-10-20

If you know in advance what your contamination is, you could

1) align to E.coli and your plant genome 2) filter out the E.coli reads from the bam 3) convert what's left of the bam back to fastq (some programs, like velvet, will take the .bam as is). Picard can do bam2fastq conversions, there are of course other ways. 4) assemble that.

Or, the more thorough

1) Align to plant genome 2) pull out unmapped reads (or unmapped bam), and assemble that 3) identify what organism the big contigs come from 4) add those genomes to your plant genome, go back to step 1