Question

Choosing the cutoff for e-value when using very small Sequences

0

Entering edit mode

4.1 years ago

ancient_learner ▴ 680

Hi I hope you all are doing good. I am dealing with IGHD genes (Mouse genome) which are small in size. We have got our HTGTS-Seq done and have got the reads I want to particularly check for D segment in the IGH locus. So my reference file here is D gene segments file to which some of my reads shall align. If I am keeping e-value cutoff as 0.01 it is not returning all possible alignments for gene segments as some of the D gene segments are 10-15bps in length. But if I change it to 0.1 I am able to get some. So the question is how should I determine the cutoff?

Blastn evalue • 1.8k views

ADD COMMENT • link 4.1 years ago by ancient_learner ▴ 680

3

Entering edit mode

I personally would say, don't use e-value as a "cutoff" kind of parameter. Use the default, which is 10.0 on top of my head. And use values like identity and coverage to filter the output. One value to calculate the e-value is the size of the database, so if you happen to find more reference sequences and add that to your database the e-value can change again. I don't know your expectations or goal but with such small sequences it is sometimes worth something the change the word_size.

ADD REPLY • link 4.1 years ago by gb ★ 2.2k

0

Entering edit mode

I think II should explain it better. I have a 400 bp sequence (which is a query in my case) like this

>seq1
ATGGCGTGCGGTGCGTGCGTGCGGTGC**AGCGTGGCGT**ATGTGCGTGGCGT.........

I have a reference file in which each entry would be close to 20 bp or even less

>Dgene1
AGTGGGTGTGAGTGAAA
>Dgene2
AGCGTGGCGTG

So my question is when I run blastn what parameters should I consider to get the good hits.

ADD REPLY • link 4.0 years ago by ancient_learner ▴ 680

0

Entering edit mode

You could use the longer sequence as the reference. Would you have reasons not to do that?

ADD REPLY • link 4.0 years ago by gb ★ 2.2k

0

Entering edit mode

As I told it's IGH locus and am looking at IGHD gene segments which are small

ADD REPLY • link 4.0 years ago by ancient_learner ▴ 680

0

Entering edit mode

Not sure why that changes anything. If you use seq1 as the reference and blast Dgene1 against it you will get a match or not. If the match is perfect (100% coverage and identity) you know that seq1 contains a IGHD segment.

Besides, blast uses the query coverage to calculate scores. If you blast a much longer sequence against a shorter one this will be extremely low. Let's say that you have a perfect match Dgene2 and your sequence is 150pb. 11/150*100=7.3% the coverage will be 7.3% which is really low. But it is a 100% percent match.

In case of Dgene1 it will be 11.3% coverage, so you will have two perfectly matching segments, one with a coverage of 7.3 and the other 11.3... If you have two perfectly matching sequences you should have two times 100%. Otherwise it is impossible to set thesholds.

ADD REPLY • link 4.0 years ago by gb ★ 2.2k