Question: What should be the minimum percent of identity and coverage of blast hits for considering as gene sequence
0
gravatar for inayat45shaikh
3.5 years ago by
India
inayat45shaikh40 wrote:

Hello group,

I had predicted peptide sequences from denovo assembled contigs using abinitio (GENSCAN) approach and subjected it to similarity (BLASTP) search to identify genes in the assemble sequences. But the difficulty i am facing is with minimum percent of identity and coverage of blast hits. What should be the minimum threshold for percent identity and coverage so that it can be said for sure that the gene is present? This is a eukaryotic genome data.

blast sequence alignment gene • 6.1k views
ADD COMMENTlink modified 3.5 years ago by Renesh1.6k • written 3.5 years ago by inayat45shaikh40
4
gravatar for Renesh
3.5 years ago by
Renesh1.6k
United States
Renesh1.6k wrote:

For BLASTp, you should look for the alignments with e-value < 0.001 (1e-3) to infer the given gene is present.

More detail: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3820096/ 

ADD COMMENTlink written 3.5 years ago by Renesh1.6k

What  nice paper! Not too long, not too short, with a simple summary on recommended parameter settings. Should be a required reading in bioinformatics. I will add this to the training list of recommended papers to read.

Other interesting tidbit, it is from the author of the FASTA suite hence the FASTA format ... 

ADD REPLYlink written 3.5 years ago by Istvan Albert ♦♦ 80k

That's quite a relaxed e-value threshold. I would say that 1e-6 is used more commonly, but it isn't going to guarantee anything, especially so with multi-domain eukaryotic proteins. Even much more strict e-value threshold, like say 1e-60 isn't going to guarantee much, since such e-value can be due to one shared domain between the query and subject sequences. OP is on the right track with applying some kind of coverage threshold. I've listed the relevant specifiers below. I would personally feel relatively confident with something like qlen/slen=1±0.25 && qlen/alen=1±0.25.

qlen means Query sequence length
slen means Subject sequence length
length means Alignment length

 

 

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by 5heikki8.4k

Thanks for the above paper.

Paper is stating, e-values and bit scores (bits > 50) is more sensitive and reliable source for inferring homology. I had filtered blast hits based on the above parameters, but the confusion still remain, what percent of coverage (% of length of the gene sequence covered in the alignment or how much length of the gene covered in the alignment) hits should have?

some of hits showing higher identity and bitscore > 50, but only covered 5-10% of the gene sequence. can we consider this as gene? is there any defined threshold for coverage of the alignment

ADD REPLYlink written 3.5 years ago by inayat45shaikh40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1961 users visited in the last hour