Question: What Is A More Meaningful E-Value For 90Bp Pair-End Illumina Data?
1
gravatar for Ken
8.8 years ago by
Ken150
Ken150 wrote:

Hi all, I used to use e-value <= 1e-5 for getting any meaningful blast alignment results. Just saw this statement in BLAST FAQ

"The lower the E-value, or the closer it is to zero, the more "significant" the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence."

That makes me wondering if e-value<=1e-5 is good or too strict for illumina 90bp pair-end data with insert size ~150bp (yeah, a small bit overlapped. I just put 10 N between readA and readB for blasting)? The word short can be tricky here. Any suggestion or 'rule of thumb' kind of experience is welcome. Thanks in advanced.

alignment blast • 2.5k views
ADD COMMENTlink written 8.8 years ago by Ken150

Why are you using BLAST? Dat's craycray.

ADD REPLYlink written 8.8 years ago by Aaron Statham1.1k
4
gravatar for Istvan Albert
8.8 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

This length is factored into computing the E-value so there is no need to account for it yet again. Or another way of saying this is that just knowing the origins of the E-values should not be the sole reason for modifying thresholds.

The filtering that you apply simply sets the false discovery rate - the right value should be the one that supports the results that you obtain.

ADD COMMENTlink modified 8.8 years ago • written 8.8 years ago by Istvan Albert ♦♦ 84k
2
gravatar for Vitis
8.8 years ago by
Vitis2.4k
New York
Vitis2.4k wrote:

I think you should try a specialized short read mapper to see the mapping/alignment qualities, usually short read mappers use an algorithm different than BLAST.

ADD COMMENTlink written 8.8 years ago by Vitis2.4k
2
gravatar for ALchEmiXt
8.8 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

To add to Istvan's comment. The e-value by itself is by definition depending on the database used (it has database specific parameters incorporated). So you can NEVER compare ABSOLUTE e-values of sequences generated on different databases..... maybe not the case here but just to keep in mind the e-valu is not the holy-grail.

I underline vitis that a short read aligner would be more appropriate and much FASTER!

ADD COMMENTlink written 8.8 years ago by ALchEmiXt1.9k

Hi ALchEmiXt, in my case, I am aligning the reads against swissprot. What would you suggest then? Thanks.

ADD REPLYlink written 8.8 years ago by Ken150

depends on the exact goal what you want to do (we need mroe context). If its RNAseq or reseq or de novo.... I think data reduction by assembly and BLASTing the contigs makes sence to me. Short sequences can be just motifs and maybe not specific enough for accurate annotation by direct blast. Have a look at TopHat, SOAP, ABYSS, and such...if its RNAseq there are usually special version for that as well. This becomes more difficult when its a metagenomics experiment though.

ADD REPLYlink written 8.8 years ago by ALchEmiXt1.9k

sounds like you don't have a reference genome. If it's from one organism, I'd do a de novo assembly first, then do the blastx against swissprot.

ADD REPLYlink written 8.8 years ago by Vitis2.4k

Thanks ALchEmiXt and vitis, yes, I am working on metagenomic data, there is no reference genome and the reads most-likely won't assemble. At this step, I want to annotate them which were identified by using my 'magical' method. Currently I am getting quite a bit of no hits by blasting against swissprot so I was just wondering maybe it's due to the strict e-value=1e-5 I am using. I agree with Istvan's opinion. By increasing the e-value, I possibly will get more reads annotated but getting higher false positive reads. I may just leave it as it is. Thanks all 4 ur helpful opinions!

ADD REPLYlink written 8.8 years ago by Ken150

@Ken, a metagenomic....we did quite some of those and indedd depending on the sample, the expect and sequencing batch size they might not assemble. But why are you BLASTing agains swissprot? For your metagenomics approach (with the details you specified) I would think NT would be more appropriate.... many sequences are just not protein coding... or did I miss something?

ADD REPLYlink written 8.8 years ago by ALchEmiXt1.9k

@Ken, ah.. metagenomic....we did quite some of those and indeed depending on the sample, the expect and sequencing batch size they might not assemble. But why are you BLASTing agains swissprot? For your metagenomics approach (with the details you specified) I would think NT would be more appropriate.... many sequences are just not protein coding... or did I miss something?

ADD REPLYlink written 8.8 years ago by ALchEmiXt1.9k

hi ALchEmiXt, yeah, it is our intention to use swissprot because we are interesting in the actual genes, not some predicted genes in NT. Thanks.

ADD REPLYlink written 8.8 years ago by Ken150

yup but a gene is usually more than only its protein encoding segment.... :-P therefore NT/nr...

ADD REPLYlink written 8.8 years ago by ALchEmiXt1.9k
0
gravatar for Nicolas Rosewick
8.8 years ago by
Belgium, Brussels
Nicolas Rosewick8.8k wrote:

I think e-val = p-val*N where N is the number of entry in your DB.

ADD COMMENTlink written 8.8 years ago by Nicolas Rosewick8.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 688 users visited in the last hour