Choosing the cutoff for e-value when using very small Sequences
0
0
Entering edit mode
4.1 years ago

Hi I hope you all are doing good. I am dealing with IGHD genes (Mouse genome) which are small in size. We have got our HTGTS-Seq done and have got the reads I want to particularly check for D segment in the IGH locus. So my reference file here is D gene segments file to which some of my reads shall align. If I am keeping e-value cutoff as 0.01 it is not returning all possible alignments for gene segments as some of the D gene segments are 10-15bps in length. But if I change it to 0.1 I am able to get some. So the question is how should I determine the cutoff?

Blastn evalue • 1.8k views
ADD COMMENT
3
Entering edit mode

I personally would say, don't use e-value as a "cutoff" kind of parameter. Use the default, which is 10.0 on top of my head. And use values like identity and coverage to filter the output. One value to calculate the e-value is the size of the database, so if you happen to find more reference sequences and add that to your database the e-value can change again. I don't know your expectations or goal but with such small sequences it is sometimes worth something the change the word_size.

ADD REPLY
0
Entering edit mode

I think II should explain it better. I have a 400 bp sequence (which is a query in my case) like this

>seq1
ATGGCGTGCGGTGCGTGCGTGCGGTGC**AGCGTGGCGT**ATGTGCGTGGCGT.........

I have a reference file in which each entry would be close to 20 bp or even less

>Dgene1
AGTGGGTGTGAGTGAAA
>Dgene2
AGCGTGGCGTG

So my question is when I run blastn what parameters should I consider to get the good hits.

ADD REPLY
0
Entering edit mode

You could use the longer sequence as the reference. Would you have reasons not to do that?

ADD REPLY
0
Entering edit mode

As I told it's IGH locus and am looking at IGHD gene segments which are small

ADD REPLY
0
Entering edit mode

Not sure why that changes anything. If you use seq1 as the reference and blast Dgene1 against it you will get a match or not. If the match is perfect (100% coverage and identity) you know that seq1 contains a IGHD segment.

Besides, blast uses the query coverage to calculate scores. If you blast a much longer sequence against a shorter one this will be extremely low. Let's say that you have a perfect match Dgene2 and your sequence is 150pb. 11/150*100=7.3% the coverage will be 7.3% which is really low. But it is a 100% percent match.

In case of Dgene1 it will be 11.3% coverage, so you will have two perfectly matching segments, one with a coverage of 7.3 and the other 11.3... If you have two perfectly matching sequences you should have two times 100%. Otherwise it is impossible to set thesholds.

ADD REPLY

Login before adding your answer.

Traffic: 1947 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6