Question

How would you discover the source of contaminant DNA ?

3

Entering edit mode

8.0 years ago

Antonio R. Franco ★ 5.1k

I am trying to assemble the genome of a tree, and I have strong evidence that the DNA could be heavily contaminated with a foreign DNA. It is not the leaves were used were externally contaminated with other organisms. I am talking about a millennial tree that could be supporting another form of live internally

How can I discover what is the source of that DNA? BlastN done with billions of sequences is not an alternative..

genome Illumina • 2.6k views

ADD COMMENT • link updated 8.0 years ago by Asaf 10k • written 8.0 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

My first thought was to use FastQ Screen, but I guess that's not an option with billions of sequences. Could you cluster similar sequences into groups, then run FastQ Screen or blastn on a representative sequence from each group? You could prioritise this by running the search on decreasing group sizes. If you have a lot of contamination the first number of groups you try would be from an alternative species?

ADD REPLY • link 8.0 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

I see that FastQ Screen is somehow similar to bbsplit.sh from the set of bbmap tools

I have already used some genomes with bbsplit, and now I know that only 0,0002% of the reads are from Verticillium (as an example)..

But this approach means that I have to download the genome(s) and have some serendipity and luck in finding the contaminated genome..

I am looking for an approach similar to a classic metagenomic study, in which you give the sequences and the program or/and service will find for me the source of contamination.

ADD REPLY • link 8.0 years ago by Antonio R. Franco ★ 5.1k

score 2 · Answer 1 · 2016-04-23

2

Entering edit mode

8.0 years ago

Damian Kao 16k

Blobology (https://github.com/blaxterlab/blobology) is a great tool for discovering contaminants.

ADD COMMENT • link 8.0 years ago by Damian Kao 16k

0

Entering edit mode

I'll give it a try..

ADD REPLY • link 8.0 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

From what I understand based on the description, blobology needs the reads to be assembled. This biases it greatly towards the genome size of the contaminant. If the contaminant has a small genome, it is more likely to be assembled. If the contaminant has a large genome, it will not get assembled, so those reads will be filtered out in the analysis.

ADD REPLY • link 8.0 years ago by igor 13k

score 1 · Answer 2 · 2016-04-23

1

Entering edit mode

8.0 years ago

igor 13k

I use Kraken for this type of task. It's similar to BLAST in many ways, but much faster. You can use the NCBI BLAST databases as references. You'll have to re-index them, but no weird data manipulation is needed.

ADD COMMENT • link 8.0 years ago by igor 13k

score 1 · Answer 3 · 2016-04-24

1

Entering edit mode

8.0 years ago

Asaf 10k

I was struggling with the same question myself and eventually built a simple classifier based on Pfam domains that I find in the assembled scaffolds. I used the species assignment of Pfam domains (can be downloaded from their website) and after assembly got ORFs and searched for domains using interproscan, I then computed the log probability that the sequence is bacterial.

ADD COMMENT • link 8.0 years ago by Asaf 10k

0

Entering edit mode

Sounds pretty interesting..

ADD REPLY • link 8.0 years ago by Antonio R. Franco ★ 5.1k

0

Entering edit mode

This biases the analysis towards the genome size of the contaminant. If the contaminant has a small genome, it is more likely to be assembled. If the contaminant has a large genome, it will not generate any scaffolds, so those reads will be filtered out in the analysis.

ADD REPLY • link 8.0 years ago by igor 13k

0

Entering edit mode

The problem is when you have a contaminant which is assembled.

ADD REPLY • link 8.0 years ago by Asaf 10k

0

Entering edit mode

The bigger problem is when nothing assembles and you don't know why.

ADD REPLY • link 8.0 years ago by igor 13k

score 0 · Answer 4 · 2016-04-24

Both approaches are pretty different. Blobology needs to assemble your reads in a genome or pseudo-genome. Then, it maps the reads to it and analyze the SAM/BAM files. It includes a bash script that many novice will find interesting to look because it contains a whole pipeline to do that

Kraken don't need to do that. It simply compares k-mers from your reads to the k-mers from a taxonomic classified kramer database that has been obtained from public databases and run a BLAST-like program which is much faster than BLAST

I definitevely will give a try to both of them. Thank you to both of you for your answers