Question

What Is A More Meaningful E-Value For 90Bp Pair-End Illumina Data?

1

Entering edit mode

13.2 years ago

Ken ▴ 170

Hi all, I used to use e-value <= 1e-5 for getting any meaningful blast alignment results. Just saw this statement in BLAST FAQ

"The lower the E-value, or the closer it is to zero, the more "significant" the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence."

That makes me wondering if e-value<=1e-5 is good or too strict for illumina 90bp pair-end data with insert size ~150bp (yeah, a small bit overlapped. I just put 10 N between readA and readB for blasting)? The word short can be tricky here. Any suggestion or 'rule of thumb' kind of experience is welcome. Thanks in advanced.

blast alignment • 3.9k views

ADD COMMENT • link updated 13.2 years ago by Nicolas Rosewick 11k • written 13.2 years ago by Ken ▴ 170

0

Entering edit mode

Why are you using BLAST? Dat's craycray.

ADD REPLY • link 13.2 years ago by Aaron Statham ★ 1.1k

score 4 · Answer 1 · 2011-09-09

This length is factored into computing the E-value so there is no need to account for it yet again. Or another way of saying this is that just knowing the origins of the E-values should not be the sole reason for modifying thresholds.

The filtering that you apply simply sets the false discovery rate - the right value should be the one that supports the results that you obtain.

score 2 · Answer 2 · 2011-09-09

2

Entering edit mode

13.2 years ago

Vitis ★ 2.5k

I think you should try a specialized short read mapper to see the mapping/alignment qualities, usually short read mappers use an algorithm different than BLAST.

ADD COMMENT • link 13.2 years ago by Vitis ★ 2.5k

score 2 · Answer 3 · 2011-09-09

2

Entering edit mode

13.2 years ago

ALchEmiXt ★ 1.9k

To add to Istvan's comment. The e-value by itself is by definition depending on the database used (it has database specific parameters incorporated). So you can NEVER compare ABSOLUTE e-values of sequences generated on different databases..... maybe not the case here but just to keep in mind the e-valu is not the holy-grail.

I underline vitis that a short read aligner would be more appropriate and much FASTER!

ADD COMMENT • link 13.2 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Hi ALchEmiXt, in my case, I am aligning the reads against swissprot. What would you suggest then? Thanks.

ADD REPLY • link 13.2 years ago by Ken ▴ 170

0

Entering edit mode

depends on the exact goal what you want to do (we need mroe context). If its RNAseq or reseq or de novo.... I think data reduction by assembly and BLASTing the contigs makes sence to me. Short sequences can be just motifs and maybe not specific enough for accurate annotation by direct blast. Have a look at TopHat, SOAP, ABYSS, and such...if its RNAseq there are usually special version for that as well. This becomes more difficult when its a metagenomics experiment though.

ADD REPLY • link 13.2 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

sounds like you don't have a reference genome. If it's from one organism, I'd do a de novo assembly first, then do the blastx against swissprot.

ADD REPLY • link 13.2 years ago by Vitis ★ 2.5k

0

Entering edit mode

Thanks ALchEmiXt and vitis, yes, I am working on metagenomic data, there is no reference genome and the reads most-likely won't assemble. At this step, I want to annotate them which were identified by using my 'magical' method. Currently I am getting quite a bit of no hits by blasting against swissprot so I was just wondering maybe it's due to the strict e-value=1e-5 I am using. I agree with Istvan's opinion. By increasing the e-value, I possibly will get more reads annotated but getting higher false positive reads. I may just leave it as it is. Thanks all 4 ur helpful opinions!

ADD REPLY • link 13.2 years ago by Ken ▴ 170

0

Entering edit mode

@Ken, a metagenomic....we did quite some of those and indedd depending on the sample, the expect and sequencing batch size they might not assemble. But why are you BLASTing agains swissprot? For your metagenomics approach (with the details you specified) I would think NT would be more appropriate.... many sequences are just not protein coding... or did I miss something?

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

@Ken, ah.. metagenomic....we did quite some of those and indeed depending on the sample, the expect and sequencing batch size they might not assemble. But why are you BLASTing agains swissprot? For your metagenomics approach (with the details you specified) I would think NT would be more appropriate.... many sequences are just not protein coding... or did I miss something?

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

hi ALchEmiXt, yeah, it is our intention to use swissprot because we are interesting in the actual genes, not some predicted genes in NT. Thanks.

ADD REPLY • link 13.1 years ago by Ken ▴ 170

0

Entering edit mode

yup but a gene is usually more than only its protein encoding segment.... :-P therefore NT/nr...

ADD REPLY • link 13.1 years ago by ALchEmiXt ★ 1.9k

score 0 · Answer 4 · 2011-09-12

0

Entering edit mode

13.1 years ago

Nicolas Rosewick 11k

I think e-val = p-val*N where N is the number of entry in your DB.

ADD COMMENT • link 13.1 years ago by Nicolas Rosewick 11k