Orf Finder Vs Blastx
4
2
Entering edit mode
11.2 years ago
User 3822 ▴ 60

This is kind of a stupid question. Suppose I have a contig sequence and I want to know what kind of protein it might encode (if it does). I run it against NCBI ORF Finder and then choose to do a blast and cognitor for the longest reading frame. Get a few hits, and conserved domains for a protein. Next I run the same contig thru NCBI blastx, and get hits for a couple of proteins. But the protein I found through the longest reading frame has a lower score in blastx.

How do I choose which protein it might be?

orf blast • 9.5k views
ADD COMMENT
0
Entering edit mode

Be careful with the phrase "lower score". A lower bit score indicates a "worse" hit, whereas a lower e-value indicates a "better" hit.

ADD REPLY
0
Entering edit mode

Yes, I meant lower bit score.

ADD REPLY
0
Entering edit mode

Is the contig genomic or a transcript? After splicing?

ADD REPLY
0
Entering edit mode

I've found an interesting discussion here. I wonder typically what frame shift penalty value(s) for BlastX can be generally used.

ADD REPLY
11
Entering edit mode
11.2 years ago
Neilfws 49k

The key thing here is that both methods are giving you useful information: just not the same information.

Let's start with the blastx result. BLASTX is a crude, quick way to see if a nucleotide sequence has protein-coding potential. It simply translates in all 6 frames and compares the resulting sequences to a protein database. If your contig contains intron sequence or sequence errors then you will not see the "true", mature protein sequence, but you will get some idea that there is one in there somewhere.

ORF finders try to look for sequences that resemble true open reading frames with a start, stop, in-frame sequence in-between and perhaps other features. Traditionally (especially in prokaryotic genomics), short ORFs are discarded and the longest is chosen as the "best". However, it's important to remember that you are dealing with predictions, not experimental data. In addition, the quality of the contig sequence will have a large bearing on the quality of the predicted ORFs.

In summary, BLASTX is a "quick and dirty" test and an ORF finder should provide a "better" prediction of true ORFs. You would expect the BLAST scores to differ, since you are looking at slightly different sequences. But always remember that you are looking at computational predictions. Experimental validation (e.g. transcript sequencing) is the only way to determine a "true" ORF.

ADD COMMENT
0
Entering edit mode

Thanks, I knew the difference between the two just wasn't clear enough.

ADD REPLY
2
Entering edit mode
11.2 years ago

Maybe you can also try to do gene prediction by using MAKER. It does several genome annotation steps among which is "producing ab-initio gene predictions". It uses many other software packages (SNAP, Augustus, GeneMark, ...) so it's a bit laborious to install.

ADD COMMENT
2
Entering edit mode
11.2 years ago
Darked89 4.2k

It all depends on quality of your query (sequencing errors producing stop codons) and the level of simmilarity to various proteins in blastx. No matter if you sequenced genomic fragment or cDNA, you can still have retained intron / large ncRNA, chimeric clone to name the most obvious cases.

IMO in most cases, assuming strong blastx hits (>60% simmilarity, ca 40aa) you will be better starting with blastx. With vague short hits to DNA of dubious quality possibly also with repeats making a call "is it a gene?" is problematic. Having a long, non repetitive ORF even without strong blastx hits is then a good hint.

Tip: do not restrict yourself just to blastx. Tblastn with ESTs from close species or genomic alignment may resolve some tricky xons/less conserved parts of protein.

ADD COMMENT
1
Entering edit mode
11.1 years ago
Ketil 4.1k

Well, the obvious explanation is that the longest ORF is not the right one. One explanation is that you have a sequencing error causing a frame shift. BLASTX isn't too great with frame shifts either, maybe you can check with another aligner?

When predicting ORFs, I use a dynamic programming algorithm that pieces together "compatible" BLASTX hits, and includes evidence like AUG, stop-codon and poly-A tail. I think this is better than just using longest ORF, I can dig up the graph comparing this if you're interested.

ADD COMMENT

Login before adding your answer.

Traffic: 1745 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6