Question

What should be the minimum percent of identity and coverage of blast hits for considering as gene sequence

0

Entering edit mode

9.7 years ago

inayat45shaikh ▴ 40

Hello group,

I had predicted peptide sequences from denovo assembled contigs using abinitio (GENSCAN) approach and subjected it to similarity (BLASTP) search to identify genes in the assemble sequences. But the difficulty i am facing is with minimum percent of identity and coverage of blast hits. What should be the minimum threshold for percent identity and coverage so that it can be said for sure that the gene is present? This is a eukaryotic genome data.

blast gene alignment sequence • 13k views

ADD COMMENT • link updated 9.7 years ago by Renesh ★ 2.2k • written 9.7 years ago by inayat45shaikh ▴ 40

Ram · Answer 1 · 2015-11-03

5

Entering edit mode

9.7 years ago

Renesh ★ 2.2k

For BLASTp, you should look for the alignments with e-value < 0.001 (1e-3) to infer the given gene is present.

More detail: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820096/

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by Renesh ★ 2.2k

0

Entering edit mode

What nice paper! Not too long, not too short, with a simple summary on recommended parameter settings. Should be a required reading in bioinformatics. I will add this to the training list of recommended papers to read.

Other interesting tidbit, it is from the author of the FASTA suite hence the FASTA format ...

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by Istvan Albert 102k

0

Entering edit mode

That's quite a relaxed e-value threshold. I would say that 1e-6 is used more commonly, but it isn't going to guarantee anything, especially so with multi-domain eukaryotic proteins. Even much more strict e-value threshold, like say 1e-60 isn't going to guarantee much, since such e-value can be due to one shared domain between the query and subject sequences. OP is on the right track with applying some kind of coverage threshold. I've listed the relevant specifiers below. I would personally feel relatively confident with something like qlen/slen=1±0.25 && qlen/alen=1±0.25.

qlen - Query sequence length
slen - Subject sequence length
length - Alignment length

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by 5heikki 11k

0

Entering edit mode

Thanks for the above paper.

Paper is stating, e-values and bit scores (bits > 50) is more sensitive and reliable source for inferring homology. I had filtered blast hits based on the above parameters, but the confusion still remain, what percent of coverage (% of length of the gene sequence covered in the alignment or how much length of the gene covered in the alignment) hits should have?

some of hits showing higher identity and bitscore > 50, but only covered 5-10% of the gene sequence. can we consider this as gene? is there any defined threshold for coverage of the alignment

ADD REPLY • link 9.7 years ago by inayat45shaikh ▴ 40