Question: Simplest output format of blast contamination check
0
gravatar for morovatunc
14 months ago by
morovatunc400
Turkey
morovatunc400 wrote:

Dear all hi,

I have finalsed a de-novo assembly of a uncharacterised organism. To be able to check bacteria etc contamination we would like to use blast NT library.

When I try blast with default settings, I simply get 33Gb of data which I dont need it. My aim is to get single percentage output such like

Bacteria —> %33 similarity.

I have also attempted to get this result with megablast run with default settings. It did not help me neither.

Is this kind of summarised result possible ?

Thank you very much for the help,

Best regards,

Tunc.

blast megablast • 572 views
ADD COMMENTlink modified 14 months ago by h.mon24k • written 14 months ago by morovatunc400
2

Try Kraken, its faster and can generate the kind of output you are asking for.

ADD REPLYlink written 14 months ago by pld4.8k

In what form/format is this assembly? Contigs? Number of them? Why are you getting 33G of data? How you are running your blast?

ADD REPLYlink written 14 months ago by genomax65k

The assembly is ~900 MB size (in fasta). The length is .9 Gbps.

I tried;

megablast -i kefal.contigs.fasta -d /mnt/compgen/inhouse/share/blastdb/nt -o kefal_canu_megablast.out
blastn  -i kefal.contigs.fasta -d /mnt/compgen/inhouse/share/blastdb/nt -o kefal_canu_megablast.out
ADD REPLYlink written 14 months ago by morovatunc400
3

Make your blast more stringent. Look for long alignments since most of the short ones are likely going to be spurious/by chance.

That said you are on a slippery slope here. If contamination was suspected in your original data then it should have been taken care of before doing the assembly. Now you can not be sure if some of the sequence in your assembly should not be there in first place.

ADD REPLYlink modified 14 months ago • written 14 months ago by genomax65k

Yeah yeah you are totally right about the precaution beforehand. Currently, I am just doing this to check to find the most obvious ones.

I dont want to mess with the default word size and mismatch levels. Is there way that I can make this tuning simpler?

ADD REPLYlink written 14 months ago by morovatunc400
1

Check the BlobTools docs for a suggestion of blast parameters. In particular, set stringent e-value cutoff, use a custom format output, and restrict the number of target sequences and hsps returned (there is a somewhat hidden side-effect of -max_target_seqs, though):

-outfmt ’6 qseqid staxids bitscore std’ \
 -max_target_seqs 1 \
 -max_hsps 1 \
 -evalue 1e-25
ADD REPLYlink modified 14 months ago • written 14 months ago by h.mon24k
2
gravatar for h.mon
14 months ago by
h.mon24k
Brazil
h.mon24k wrote:

When I try blast with default settings, I simply get 33Gb

What kind of output? Xml? Have you tried tabular format? How can you get an output that is probably bigger than your assembly by at least an order of magnitude?

For simple taxomomic identification of contigs, have a look at Kraken (very memory-intensive), Centrifuge or CLARK. However, for taxonomic contamination check, BlobTools is really nice, more helpful than blast alone.

ADD COMMENTlink written 14 months ago by h.mon24k

For the runs please see the reply on top.

I have taken a look to blobtools paper. It seems to mine output of blast to make the graphs actually.

blastn -outfmt "6 qseqid staxids bitscore std" -query kefal.contigs.fasta -db /mnt/compgen/inhouse/share/blastdb/nt -o kefal_canu_megablast.out ( this is from the bloptools methods)

I just tried to find a some kind of output to solve this problem instead of implementing couple tools. Thank you for the answer and I will definitely give a try to the your suggestions.

ADD REPLYlink written 14 months ago by morovatunc400
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 666 users visited in the last hour