How would you discover the source of contaminant DNA ?
4
3
Entering edit mode
8.0 years ago

I am trying to assemble the genome of a tree, and I have strong evidence that the DNA could be heavily contaminated with a foreign DNA. It is not the leaves were used were externally contaminated with other organisms. I am talking about a millennial tree that could be supporting another form of live internally

How can I discover what is the source of that DNA? BlastN done with billions of sequences is not an alternative..

genome Illumina • 2.6k views
ADD COMMENT
0
Entering edit mode

My first thought was to use FastQ Screen, but I guess that's not an option with billions of sequences. Could you cluster similar sequences into groups, then run FastQ Screen or blastn on a representative sequence from each group? You could prioritise this by running the search on decreasing group sizes. If you have a lot of contamination the first number of groups you try would be from an alternative species?

ADD REPLY
0
Entering edit mode

I see that FastQ Screen is somehow similar to bbsplit.sh from the set of bbmap tools

I have already used some genomes with bbsplit, and now I know that only 0,0002% of the reads are from Verticillium (as an example)..

But this approach means that I have to download the genome(s) and have some serendipity and luck in finding the contaminated genome..

I am looking for an approach similar to a classic metagenomic study, in which you give the sequences and the program or/and service will find for me the source of contamination.

ADD REPLY
2
Entering edit mode
8.0 years ago

Blobology (https://github.com/blaxterlab/blobology) is a great tool for discovering contaminants.

ADD COMMENT
0
Entering edit mode

I'll give it a try..

ADD REPLY
0
Entering edit mode

From what I understand based on the description, blobology needs the reads to be assembled. This biases it greatly towards the genome size of the contaminant. If the contaminant has a small genome, it is more likely to be assembled. If the contaminant has a large genome, it will not get assembled, so those reads will be filtered out in the analysis.

ADD REPLY
1
Entering edit mode
8.0 years ago
igor 13k

I use Kraken for this type of task. It's similar to BLAST in many ways, but much faster. You can use the NCBI BLAST databases as references. You'll have to re-index them, but no weird data manipulation is needed.

ADD COMMENT
1
Entering edit mode
8.0 years ago
Asaf 10k

I was struggling with the same question myself and eventually built a simple classifier based on Pfam domains that I find in the assembled scaffolds. I used the species assignment of Pfam domains (can be downloaded from their website) and after assembly got ORFs and searched for domains using interproscan, I then computed the log probability that the sequence is bacterial.

ADD COMMENT
0
Entering edit mode

Sounds pretty interesting..

ADD REPLY
0
Entering edit mode

This biases the analysis towards the genome size of the contaminant. If the contaminant has a small genome, it is more likely to be assembled. If the contaminant has a large genome, it will not generate any scaffolds, so those reads will be filtered out in the analysis.

ADD REPLY
0
Entering edit mode

The problem is when you have a contaminant which is assembled.

ADD REPLY
0
Entering edit mode

The bigger problem is when nothing assembles and you don't know why.

ADD REPLY
0
Entering edit mode
8.0 years ago

Both approaches are pretty different. Blobology needs to assemble your reads in a genome or pseudo-genome. Then, it maps the reads to it and analyze the SAM/BAM files. It includes a bash script that many novice will find interesting to look because it contains a whole pipeline to do that

Kraken don't need to do that. It simply compares k-mers from your reads to the k-mers from a taxonomic classified kramer database that has been obtained from public databases and run a BLAST-like program which is much faster than BLAST

I definitevely will give a try to both of them. Thank you to both of you for your answers

ADD COMMENT

Login before adding your answer.

Traffic: 2557 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6