Question: Identifying my WGS isolates
0
gravatar for jeancampbell212121
21 months ago by
jeancampbell2121210 wrote:

Hi

Assembly rookie here. I have been BLASTing some of my contigs from a de novo assembly I did, and the results are coming back with very poor query coverage. Since this is an environmental isolate, I guess it's possible that there aren't many whole genome records. I tested contigs of varying sizes (largest 338 075 nt with total read count of 35 131 and coverage of 26; down to smaller contigs in the 1500 nt range with total read count of 1647). What I want to know is what is the chance that my trouble with identification is due to a) contamination in the DNA; or b) that my assembly is so shitty BLAST can't make a match?

Thanks in advance

blast next-gen assembly • 554 views
ADD COMMENTlink modified 21 months ago by Tonor420 • written 21 months ago by jeancampbell2121210

Have you taken a few of the original reads and blasted them to make sure they show reasonable/expected hits?

ADD REPLYlink modified 21 months ago • written 21 months ago by genomax51k

When you have nice, long contigs that don't BLAST to anything, it's likely that it's a novel organism (at least, distant compared to anything in your BLAST database). Metagenomes usually have low coverage, and assemblers don't tend to output random junk as long contigs. Contamination is more likely to give you good BLAST hits than the metagenomic target, since the same contamination tends to be seen in lots of labs. Thus, contamination is not a very likely explanation for this outcome.

Oops, I misread and thought this was a metagenome rather than an isolate. If the species you expect is in your BLAST database, this is not the species you expect.

ADD REPLYlink modified 21 months ago • written 21 months ago by Brian Bushnell15k

It depends on how you are defining contamination. If you think you are working with species A, but your assembly gives you species B, that could be considered contamination even if it's a great assembly. It's not clear if that's the case here.

ADD REPLYlink modified 21 months ago • written 21 months ago by igor6.3k
0
gravatar for igor
21 months ago by
igor6.3k
United States
igor6.3k wrote:

You can try running all your raw reads through a metagenomic classifier. There are a few of those out there, such as:

These will quickly classify a lot of reads and show you if you have contamination. Ideally, most of them will be from a similar source.

ADD COMMENTlink modified 21 months ago • written 21 months ago by igor6.3k
0
gravatar for Tonor
21 months ago by
Tonor420
UK
Tonor420 wrote:

I would also try DIAMOND: https://github.com/bbuchfink/diamond

It uses blastx and translates your DNA contigs/reads and searches against a protein databases rather than nucleotide blastn. As Brian said when you have large good quality contigs that don't hit anything, it tends to imply a novel organism. But protein space is better to use in this case, as protein sequences are more conserved between related species than nucleotide sequences, so you even though a blastn may give you no significant hits, blastx could well give you some to related organisms.

ADD COMMENTlink written 21 months ago by Tonor420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1534 users visited in the last hour