Contaminating Sequences And Genome Assembly
2
0
Entering edit mode
12.5 years ago
Fabian Bull ★ 1.3k

I just realised that my newest assembly contains contamination (E. coli) so my question is:

How to deal with contamination in de novo plant assemblies?

Do you know of any database I can blast against to find contamination? Do you know any other good pipeline to remove contamination?

assembly • 4.6k views
ADD COMMENT
1
Entering edit mode
12.5 years ago

You could try to identify the contaminating entries and remove them. There are sequence databases of plasmids and cloning vectors (BACs, YACs, etc) that you can screen against. You can also screen against the E. coli reference genome (access to the sequence data of dozens of E. coli strains are listed here).

You must remove or mask the contaminating E. coli because it will wreak havoc on your assembly. It may very well be that you need to go back to the reads, and remove bacterial, cloning vector and other contaminants at that level. Then, reassemble in order to get the best (longest, fewest in number) genome contigs.

ADD COMMENT
1
Entering edit mode
12.5 years ago
Swbarnes2 ★ 1.6k

If you know in advance what your contamination is, you could

1) align to E.coli and your plant genome 2) filter out the E.coli reads from the bam 3) convert what's left of the bam back to fastq (some programs, like velvet, will take the .bam as is). Picard can do bam2fastq conversions, there are of course other ways. 4) assemble that.

Or, the more thorough

1) Align to plant genome 2) pull out unmapped reads (or unmapped bam), and assemble that 3) identify what organism the big contigs come from 4) add those genomes to your plant genome, go back to step 1

ADD COMMENT

Login before adding your answer.

Traffic: 3383 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6