Question

Identifying my WGS isolates

0

Entering edit mode

7.6 years ago

jeancampbell212121 • 0

Hi

Assembly rookie here. I have been BLASTing some of my contigs from a de novo assembly I did, and the results are coming back with very poor query coverage. Since this is an environmental isolate, I guess it's possible that there aren't many whole genome records. I tested contigs of varying sizes (largest 338 075 nt with total read count of 35 131 and coverage of 26; down to smaller contigs in the 1500 nt range with total read count of 1647). What I want to know is what is the chance that my trouble with identification is due to a) contamination in the DNA; or b) that my assembly is so shitty BLAST can't make a match?

Thanks in advance

next-gen blast Assembly • 1.4k views

ADD COMMENT • link updated 7.6 years ago by Tonor ▴ 480 • written 7.6 years ago by jeancampbell212121 • 0

0

Entering edit mode

Have you taken a few of the original reads and blasted them to make sure they show reasonable/expected hits?

ADD REPLY • link 7.6 years ago by GenoMax 141k

0

Entering edit mode

When you have nice, long contigs that don't BLAST to anything, it's likely that it's a novel organism (at least, distant compared to anything in your BLAST database). Metagenomes usually have low coverage, and assemblers don't tend to output random junk as long contigs. Contamination is more likely to give you good BLAST hits than the metagenomic target, since the same contamination tends to be seen in lots of labs. Thus, contamination is not a very likely explanation for this outcome.

Oops, I misread and thought this was a metagenome rather than an isolate. If the species you expect is in your BLAST database, this is not the species you expect.

ADD REPLY • link 7.6 years ago by Brian Bushnell 20k

0

Entering edit mode

It depends on how you are defining contamination. If you think you are working with species A, but your assembly gives you species B, that could be considered contamination even if it's a great assembly. It's not clear if that's the case here.

ADD REPLY • link 7.6 years ago by igor 13k

score 0 · Answer 1 · 2016-09-26

You can try running all your raw reads through a metagenomic classifier. There are a few of those out there, such as:

Kraken: https://ccb.jhu.edu/software/kraken/
GOTTCHA: http://lanl-bioinformatics.github.io/GOTTCHA/
CLARK: http://clark.cs.ucr.edu/
MetaPhlAn: http://huttenhower.sph.harvard.edu/metaphlan

These will quickly classify a lot of reads and show you if you have contamination. Ideally, most of them will be from a similar source.

score 0 · Answer 2 · 2016-09-27

I would also try DIAMOND: https://github.com/bbuchfink/diamond

It uses blastx and translates your DNA contigs/reads and searches against a protein databases rather than nucleotide blastn. As Brian said when you have large good quality contigs that don't hit anything, it tends to imply a novel organism. But protein space is better to use in this case, as protein sequences are more conserved between related species than nucleotide sequences, so you even though a blastn may give you no significant hits, blastx could well give you some to related organisms.