Contamination in assembly
2
0
Entering edit mode
4.2 years ago
elena.syp • 0

Hi, I am doing a de novo assembly and I have contamination with Saccharomyces. I cleaned the raw reads (paired reads 150nt) using BBsplit, minimum identity 0.95, but now that I have repeated the assembly I have checked and I still have several contigs which match perfectly or almost perfect with Saccharomyces. What should I do? Repeat the cleaning in the raw reads with lower minimum identity? Cleaning the merged reads? Cleaning the assembly? Thanks!

Assembly bbmap bbsplit • 2.3k views
1
Entering edit mode

Just curious. How did the fish data get contaminated with yeast in the first place?

0
Entering edit mode

I just received the RNA, the people who took the tissue samples extracted the RNA with the same equipment and lab that other people working with bacteria, so I suppose it got contaminated there, but some samples are more contaminated than others, I didn't realize until I finished the assembly, did DESeq and Saccharomyces genes pop up as differentially expressed genes :)

0
Entering edit mode

Ah - if this is RNA-seq, that changes things. When you have introns and you are mapping RNA-seq reads to a genome, leave maxindel at default and probably minid at default, as well, or the reads spanning introns won't map.

0
Entering edit mode

Saccharomyces =/= bacteria (it is a yeast/eukaryote). So it is possible that you don't have any contamination in your data but are seeing conserved genes. I would be very surprised if the people who did the RNA were so careless that a completely different species contaminated your samples.

Here is a test: If you were to do the blast search against a different genome - I assume you used BLAST at NCBI - Enter this in the "Organism" field as test (Drosophila melanogaster (taxid:7227)) and see if you get hits to fruit fly. Saccharomyces may have been the first in the db so it showed up first.

0
Entering edit mode

"I would be very surprised if the people who did the RNA were so careless that a completely different species contaminated your samples".

I wouldn't :P . We've had carry over of DNA on the MiSeq before and I was having a discussion just this morning about the frequency of contaminated assemblies. Yeasts are natural commensals so could have come from the people/air/surfaces/equipment/lab - you name it...

0
Entering edit mode

I recommend bleach wash for carryover contamination if you're not doing it already...

0
Entering edit mode

Yeah we've since resolved it. It wasn't presenting an issue for our assemblies until we started working on aDNA samples and we're seeing suspect reads in Kraken. We now know that it can cause issues so are on the lookout for it.

0
Entering edit mode

I'm not proposing this as an answer but its worth ruling out at least

"Current ecological understanding of fungal-like pathogens of fish: what lies beneath?"

https://www.ncbi.nlm.nih.gov/pubmed/24600442/

0
Entering edit mode

Is the real genome you are working with very distinct from S. cerevisiae? I am going to guess that it is not because otherwise bbsplit should have worked as intended.

0
Entering edit mode

Yes, is a fish trancriptome

0
Entering edit mode

Can you provide the bbsplit command you used? What program did you use for the assembly? Also provide some additional information about several contigs. Is that 1%, 10% what fraction of the total? I will add "bbmap" tag to your post so @Brian will be sure to see this.

0
Entering edit mode

I'm not sure whether this would make a radical difference or not, but can you try the approach from the other side? Map all your reads to Saccharomyces, keeping the unmapped reads which should be fish-specific, and then see what you're left with?

0
Entering edit mode

That is sorta kinda what @elena did when using BBsplit with just the yeast genome.

0
Entering edit mode

Ah ok, I've done it 'manually' in the past with samtools and bwa when I was trying to assemble plasmids from illumina data of lab strains. Only managed to get it to assemble into 1 perfect contig after discarding fragments of the host gDNA in the miniprep by alignment. Never actually used the BBmap suite. Might have to investigate it further.

2
Entering edit mode

BBSplit is excellent at binning reads to multiple genome references. BBTools are easy to use and only thing you need is Java v.1.7 and above. There is no installation needed. Download, uncompress and use. Many bbtools are multi-threaded, efficiently written and handle data sets of almost any size.

0
Entering edit mode

That approach works fine when the two organisms have zero homology, but that is rarely the case in practice - particularly when you consider low-complexity sequence (e.g. ATATATATATAT...) which might be present in anything. It is typically better to not remove contamination at all prior to assembly, and just try to do so afterward on the contigs, than to remove it overly-aggressively by discarding all reads that map to the reference of the unmasked suspected contaminant - at least, when working with relatively short (100-150bp) reads.

Of course, the approach becomes safer the more distantly related the organisms are, and the smaller they are; so if you're working with bacteria from different phyla, it's much less likely to cause gaps in your assembly compared to two eukaryotes.

0
Entering edit mode

What is the difference in GC content between cerevisiae and your fish organism? If it is clearly divided, you can also divide the scaffolds after assembly depending on the GC content.

0
Entering edit mode

Yes, this is the command I used for BBslit:

bbmap/bbsplit.sh minid=0.95 in1=ZFG-16-01_01_17137_CGATGT_L001_R1_001.fastq.gz in2=ZFG-16-01_01_17137_CGATGT_L001_R2_001.fastq.gz ref=Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa basename=out_filtered_ZFG-16-01_01_L001_%.fq outu1=filtered_ZFG-16-01_01_17137_CGATGT_L001_R1_001.fastq.gz outu2=filtered_ZFG-16-01_01_17137_CGATGT_L001_R2_001.fastq.gz


I did the same with 22 samples. Then, I merged the filtered reads using FLASH and I assembled the merged (57-62%) and the non-mergable reads using the CLC Assembly cell. This lead me to up to 764,000 contigs, which I blasted against Sacharomyces cerivisae genome and I got between 300 and 600 hits (lowest e-value 0 ) per saccharomyces chromosome (16).

1
Entering edit mode

Hmm. Even at 600 that is 0.08% of total that show a hit to S. cerevisiae. They could be genes that are evolutionarily conserved (or things that bbsplit may have let through since you did not explicitly set how ambiguous mappings were to be handled). What types of hits are you seeing among those 600?

bbmerge.sh is probably a better option for merging instead of FLASH. Any reason why you chose FLASH over that?

2
Entering edit mode
4.2 years ago

If you have low-quality reads, they can still assemble even though they do not map to the reference at 0.95 identity.

But to be clear, in this case you're not exactly using BBSplit as designed; you're supposed to use it with multiple references. If you only have one reference, use BBMap. Also, for decontamination, if you want a high threshold like minid=0.95, it's useful to use something like "qtrim=rl trimq=15 untrim". This will quality-trim reads prior to mapping, then undo the trimming after mapping. As a result, for low quality reads that are not 95% identity to your contaminant due to errors, those will still get mapped. But I think minid=0.95 is pretty high for organisms in different kingdoms; I usually use 0.9. In summary, I suggest this command:

bbmap.sh  ref=Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa maxindel=100 minid=0.90 in1=ZFG-16-01_01_17137_CGATGT_L001_R1_001.fastq.gz in2=ZFG-16-01_01_17137_CGATGT_L001_R2_001.fastq.gz outu1=filtered_ZFG-16-01_01_17137_CGATGT_L001_R1_001.fastq.gz outu2=filtered_ZFG-16-01_01_17137_CGATGT_L001_R2_001.fastq.gz  qtrim=rl trimq=15 untrim


I also recommend that you first mask the reference with BBMask to avoid hits to low-complexity areas.

0
Entering edit mode

Followed by bbmerge.sh --> CLC assembly or direct CLC assembly. Which would be better?

0
Entering edit mode

Ah - good question; I've never used CLC. With Spades, Ray, and Tadpole, assembling after running BBMerge yield a superior assembly, in my testing. With Soap Denovo and its successor Megahit, inferior. These results were repeatable across many libraries and organisms. Thus, I can only recommend that for CLC, the test is done both ways, since whether merging is beneficial varies by assembler. Note that in all cases you should assemble with both the merged and unmerged reads.

0
Entering edit mode

According to FastQC my reads are very good quality, but in any case, I'm going to try BBMap with stringent parameters (0.90) and let's see. I have checked other samples from the same fish species that are not supposed to be contaminated and I have more less the same Saccharomyces hits, so...they might be very conserved genes....Isn't weird that if those reads are corresponding to very conserved genes the only hit that BLAST gives is Saccharomyces cerevisiae? I mean, if its very conserved it should give me lots of organisms as hits, right? BLAST just gives me Sacharomyces cerevisiae, the name of the strain and the chromosome, it does not give the the gene name...im confused. But thanks! I will try BBMap

0
Entering edit mode

Be sure to take this post into account. If you are mapping RNA-seq data to a genome, you need more relaxed mapping criteria to allow for the introns. I'd recommend default minid and default maxindel.

0
Entering edit mode
4.2 years ago
vmicrobio ▴ 270

map your reads against your fish genome (bowtie2,...) then then keep aligned reads using samtools -F4 option. You can also re-map the reads obtains against Saccharomyces to see if there is homology

0
Entering edit mode

In this case there is likely no fish reference genome, hence the de novo transcriptome assembly.

If there is one then bbsplit with both fish and yeast genomes would be the way to go.