Question

BLAST homolog search

0

Entering edit mode

4.2 years ago

varun.chopra ▴ 10

Hey guys

I've been working with BLAST in an attempt to identify homologs for genes in mosquitos (specifically Anopheles stephensi).

I found the coding region sequences I needed using fly base, (D.melanogaster as the species), and BLAST-ed these sequences on vectorbase against the stephensi databases they had.

My question is how would I identify a good candidate gene to explore homolog nature. I know a low e-score would be good, but any ideas about identity %/score? I've got low identity scores of 60%, but a very low e-value. Should i further look into a hit like that or move on to something better?

BLAST homolog alignment • 1.3k views

ADD COMMENT • link updated 4.2 years ago by lieven.sterck 15k • written 4.2 years ago by varun.chopra ▴ 10

2

Entering edit mode

e-value is not a score for similarity. It is the chance that you will find another similar sequence in the used database. That value will change if you use another reference database. (Some one can probably explain it better then me)

https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#expect

I think for this kind of searches people mostly do BLASTX or TBLASTX. Not an exact answer but maybe it help you on the way.

ADD REPLY • link 4.2 years ago by gb ★ 2.2k

0

Entering edit mode

I didn't word my initial question properly that's my bad!

I know the e-value says nothing about similarity just more of the probability of the sequence matching with another sequence randomly. I did browse the FAQ and BLAST information pages on NCBI though I haven't had much luck figuring this out. Like I said to lieven.sterck I'm trying to figure out what range of all the metrics (e-value, identity%, score, etc) I should look for in trying to find a homolog.

BLASTX and TBLASTX were my other option i'll try em out thank you!

ADD REPLY • link 4.2 years ago by varun.chopra ▴ 10

1

Entering edit mode

Someone once told my that two proteins are already called homologs with an identity of 30%. If it is true I am not sure.

Is this article they also trow with some numbers (also around 30%): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820096/

Here some one says 40% https://www.researchgate.net/post/When_do_you_consider_two_proteins_to_be_homologous Maybe you can find a paper where they do something similar and validate that and use it also.

Another thing you could may do is to extract the genes/proteins from your reference database that are similar (have the same function) and do an all vs all alignment. Then you could take the lowest alignment score as your homolog threshold.

ADD REPLY • link 4.2 years ago by gb ★ 2.2k

0

Entering edit mode

This definitely helps a ton thank you so much!

ADD REPLY • link 4.2 years ago by varun.chopra ▴ 10

1

Entering edit mode

I don't remember so good anymore but I think when I did a all vs all alignment I used one of these tools: https://www.drive5.com/usearch/manual7/allpairs_global.html

https://www.drive5.com/usearch/manual7/allpairs_local.html

ADD REPLY • link 4.2 years ago by gb ★ 2.2k

score 1 · Answer 1 · 2020-03-10

1

Entering edit mode

4.2 years ago

lieven.sterck 15k

there will not be a single metric that will tell you this. What you need to do is to take several metrics into account (and place them in context).

Eg. take %identity/similarity combined with hit length (a very similar hit on only a small region of your input query is less meaningful than one with a lower %similarity but which covers nearly your complete input query)

on a side note: do NOT confuse homology and similarity ! these are two different things (the former is a binary thing: it's homologous or not, while the later can be put in a percentage, something is X% similar to something else). Similarity is often the basis for homology analysis though

ADD COMMENT • link 4.2 years ago by lieven.sterck 15k

1

Entering edit mode

I've understood that similarity is a metric used to possibly identify gene homology, though I'm trying to figure out how much similarity i should probably look for. For example, I know a very low similarity would signify that the genes don't share a lot in common, likely means they are not homologs.

I'm trying to figure out what mix of metrics and more importantly what range in said metrics (like low e-value + high identity = possible homolog candidate) I need.

ADD REPLY • link 4.2 years ago by varun.chopra ▴ 10

1

Entering edit mode

ok, not sure if there is a definite answer to that (it highly depends on the proteins in focus), something that is often used though is the notion of 30% identity over at least 70% of the length to determine homologs . Then again, that provides you a basis you should still do other analyses to determine the being homologs or not.

Rost criterion (Rost, 1999) (2 proteins are homologous when they share > 30% identical residues on an alignable region of ≥ 150 aa. If the alignable region is < 150 aa, a cut off curve based on homology derived secondary structure prediction identity is used to determine whether the two sequences are homologous)
Li-Rost criterion (Li, 2001) (the difference with the above criterion is that the percent identity is now recalculated from a similarity over the alignable region, to a similarity along the entire amino acid sequence)