Question

Removing contamination from WGS nanopore reads prior to assembly

0

Entering edit mode

2.3 years ago

harte • 0

Hello, I sent some isolated fungal DNA from a pure culture to a well-known sequencing company for sequencing using nanopore (promethion) and also illumina PE150 with the objective of generating a hybrid de-novo genome assembly using SPAdes. I assembled each dataset separately (Flye for nanopore, SPAdes for illumina) to compare those before proceeding with a hybrid assembly using SPAdes. I checked the assemblies for contamination by making BLAST databases from them and then using BLAST with fungal ITS (using reference ITS barcode of my expected taxon) and bacterial 16s (reference sequence for E. coli). The illumina assembly returned only the fungal ITS sequence of my expected taxon and no matches to the E. coli 16s query.

When I used the same procedure on the Flye-assembled nanopore data, the E. coli 16s sequence returned a match. When I copied the matching subject sequence and the used BLAST against the NCBI database, it matched >99% against a portion of the Oryza sativa chloroplast reference genome assembly. I then copied large portions of that genome and again used BLAST to search my Flye assembly. Within the assembly there was a single large contig with fairly high coverage that appears to be the entire Oryza sativa chloroplast. The fungal ITS barcode from my expected taxon matched as well, so I know there is a mix of my target DNA as well as contaminating Oryza sativa DNA. I was able to find a few more contigs of Oryza sativa that were not chloroplast so it appears as though my assembly has complete genomic DNA from Oryza in it.

How can I remove the Oryza DNA from a concatenated fastq file of all the nanopore raw reads before proceeding with a hybrid assembly using SPAdes?

assembly contamination nanopore de-novo • 1.5k views

ADD COMMENT • link 2.3 years ago by harte • 0

score 1 · Answer 1 · 2023-06-18

1

Entering edit mode

2.3 years ago

GenoMax 154k

You can use minimap2 to identify sequences that are Oryza from your original reads (preferentially) or from the assemblies.

That said if you are sure that there is contamination (and you are 100% certain that the sample you sent in for sequencing was not the source for it) then is would be reasonable to ask the sequence provider for a resolution. It would help to be polite and provide evidence and ask as to what they would be willing to do to address the problem.

ADD COMMENT • link 2.3 years ago by GenoMax 154k

0

Entering edit mode

Thanks for answering, we did reach out to the provider and they have not been easy to work with on this issue. Given the illumina data from the same sample was free of contaminating DNA, it's pretty clear there was library contamination when they were running nanopore. In any case, this was a particularly precious sample and there is no DNA left to work with so we are stuck with our situation. I went ahead and followed your advice and used minimap2 to remove contaminating DNA, although our assembly is still proving to be challenging and that is likely due to our target organism having a difficult genome.

ADD REPLY • link 2.3 years ago by harte • 0