Assesing contaminants sequences from long genomic contigs
1
0
Entering edit mode
2.5 years ago

We got the first genomic draft assembled from a 1,2Gb organism using PacBio. We got over 4000 long contigs

Now we want to assess whether this genome draft is contaminated or not with fungal sequences or any other origen

Most metagenomic analyzer use short reads or metagenomic contigs not so long like when trying to obtain a genome

Are you aware of any tools that will allow such analysis ?

genomic PacBio sequences contaminants • 1.5k views
ADD COMMENT
2
Entering edit mode
2.5 years ago
dthorbur ★ 2.5k

I've been looking into similar tools recently. I've been looking into ONT reads, but the tools should be similar if not easier to use with PacBio given the lower error rate.

kraken2 seems to work well, but coupling the high error rate of ONT's reads with a kmer approach sounds like a recipe for problems to me, but the literature seems to suggest it is possible: Benchmarking the MinION: Evaluating long reads for microbial profiling.

MetaMaps is another tool that seems to achieve similar if not slightly higher accuracy at assigning taxonomic identifications to long reads: Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps

FastQ Screen is a quick QC step that checks the taxonomic assignment of reads early in pipelines: FastQ Screen: A tool for multi-genome mapping and quality control

ADD COMMENT
0
Entering edit mode

There is a subtle consideration in my question. ¿Can you use these tools with very long contigs made up with already assembled PacBio o ONT reads?. I can use the reads, but the amount of data is huge. And I am considering the probability of using the assembled fasta files, which are vastly longer than the reads themselves

ADD REPLY
2
Entering edit mode

Ah, I glossed over that aspect of the question. In that case, I would suggest kraken2 might be the best fit given it's a kmer-based approach, which in theory should increase confidence of calls when read length is increased (even with high error rates, you should get a few exact matches between the read and correct mapping locations). Though, if you've already polished the assembly with Illumina reads, then the error rate should be reduced anyway. However, I am unsure what you'll be able to do with the report and what the next steps would be if you did find contamination.

Whilst I like the concept of MetaMaps more, it seems to take into account the frequency distribution of reads to inform posteriors about the composition and location of hits, so I think a draft genome will violate some of the assumptions here. But I've only skimmed the paper, so I may have misinterpreted this.

ADD REPLY
0
Entering edit mode

you certainly focused my interest in using MetaMaps. I am reading the paper now.

ADD REPLY
1
Entering edit mode

Do you have a related genome available that you could first use to see if you can pare the list of potential "contaminated" contigs down? Aligning with minimap2 would likely be one option. More detailed analysis can then be done by a local aligner like blast to identify HSP's.

But at some point you will want to examine the original reads by aligning them back to the assembly you have.

ADD REPLY
0
Entering edit mode

I am aware of minimap2 and its recent evolution ( a very active project indeed). But certainly, I rather use a truly and overall metagenomic analysis. I have not idea about what the contaminant could be in advance.

After looking for information, I am convinced that most of metagenomic tools are mainly designed to use short reads. Very recently, a new set of tools are appearing for long reads such as those coming from PacBio and ONT

But I am certainly interested in knowing whether this went far from these two alternatives. I think it would be interested in assessing whether a whole assembled genome is or not contaminated. I know that this assessment should be done before the assembly procedure, and that tools such as BBMap or any other filtering tools can be used for cleaning and separating your reads. In addition, I have the feeling that the analysis of an entire genome will require less computer resources as you get rid of reiterative and redundant reads (i.e. you ended with 1,5Gb of data after assembling 40, 50 or more Gb of reads)

But after mapping certain Illumina reads to a presumible mature genome, I found convincing evidence that this published genome was contaminated not only with adapter sequences badly filtered before the assembly, but with a presumible biotrophic fungal population

Thus, my interest in knowing whether you can assess the metagenomic population of already assembled genomes, because in many cases, either you don't have access to the original reads or you can save on computer resources

ADD REPLY
0
Entering edit mode

Thus, my interest in knowing whether you can assess the metagenomic population of already assembled genomes, because in many cases, either you don't have access to the original reads or you can save on computer resources

That is going to be tricky at best as you probably realize :-) It sounds like you actually want to check a published genome and not the one you assembled.

ADD REPLY
0
Entering edit mode

I was referring to somebody else's genome. I have access to my reads, of course.., Lol. To be clear. I was working with the olea genome that I did not assemble myself nor I have access to the original reads. I got evidence that it was contaminated

ADD REPLY
0
Entering edit mode

I was referring to the case of other genome as well. Metagenomic assemblies are an "entity" one generates (which may not necessarily be a reflection of biology). We may never know the actual truth since we don't have an idea of what was there to begin with.

ADD REPLY

Login before adding your answer.

Traffic: 1886 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6