Question: Identifying my WGS isolates
0
gravatar for jeancampbell212121
14 months ago by
jeancampbell2121210 wrote:

Hi

Assembly rookie here. I have been BLASTing some of my contigs from a de novo assembly I did, and the results are coming back with very poor query coverage. Since this is an environmental isolate, I guess it's possible that there aren't many whole genome records. I tested contigs of varying sizes (largest 338 075 nt with total read count of 35 131 and coverage of 26; down to smaller contigs in the 1500 nt range with total read count of 1647). What I want to know is what is the chance that my trouble with identification is due to a) contamination in the DNA; or b) that my assembly is so shitty BLAST can't make a match?

Thanks in advance

blast next-gen assembly • 386 views
ADD COMMENTlink modified 13 months ago by Tonor400 • written 14 months ago by jeancampbell2121210

Have you taken a few of the original reads and blasted them to make sure they show reasonable/expected hits?

ADD REPLYlink modified 14 months ago • written 14 months ago by genomax37k

When you have nice, long contigs that don't BLAST to anything, it's likely that it's a novel organism (at least, distant compared to anything in your BLAST database). Metagenomes usually have low coverage, and assemblers don't tend to output random junk as long contigs. Contamination is more likely to give you good BLAST hits than the metagenomic target, since the same contamination tends to be seen in lots of labs. Thus, contamination is not a very likely explanation for this outcome.

Oops, I misread and thought this was a metagenome rather than an isolate. If the species you expect is in your BLAST database, this is not the species you expect.

ADD REPLYlink modified 13 months ago • written 13 months ago by Brian Bushnell14k

It depends on how you are defining contamination. If you think you are working with species A, but your assembly gives you species B, that could be considered contamination even if it's a great assembly. It's not clear if that's the case here.

ADD REPLYlink modified 13 months ago • written 13 months ago by igor4.7k
0
gravatar for igor
14 months ago by
igor4.7k
United States
igor4.7k wrote:

You can try running all your raw reads through a metagenomic classifier. There are a few of those out there, such as:

These will quickly classify a lot of reads and show you if you have contamination. Ideally, most of them will be from a similar source.

ADD COMMENTlink modified 14 months ago • written 14 months ago by igor4.7k
0
gravatar for Tonor
13 months ago by
Tonor400
UK
Tonor400 wrote:

I would also try DIAMOND: https://github.com/bbuchfink/diamond

It uses blastx and translates your DNA contigs/reads and searches against a protein databases rather than nucleotide blastn. As Brian said when you have large good quality contigs that don't hit anything, it tends to imply a novel organism. But protein space is better to use in this case, as protein sequences are more conserved between related species than nucleotide sequences, so you even though a blastn may give you no significant hits, blastx could well give you some to related organisms.

ADD COMMENTlink written 13 months ago by Tonor400
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1476 users visited in the last hour