Question: Why Markov model is depended to size of dataset?
0
gravatar for Farbod
2.5 years ago by
Farbod3.2k
Toronto
Farbod3.2k wrote:

Dear Friends, Hi

I have used several programs (mentioned here) for finding potentially ORF and coding ability in some of my hit-less transcripts after performing BLAST.

Intrestingly (or according to bad-luck) there were no overlap between the results of those programs.

I have heard that most of these ORF finders are based on Markov model, which is trained based on the full data set and If we run it just based on a small set of sequences, it's not going to be trained properly and your false positive ORF prediction will be high.

1- Is this really the purpose of having no overlap between the results?

2- Why Markov model is depended to input size/dataset?

3- Isn't it analyse each sequence separately?

~ Thank you in advance

hmm sequence gene software error • 1.0k views
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Farbod3.2k
1
gravatar for RamRS
2.5 years ago by
RamRS21k
Houston, TX
RamRS21k wrote:

Markov Models learn from the training data set and apply that "knowledge" to your dataset. Like any statistical model, the power of the test goes up with the sample size. That being said, there is always a lower end to the sample size - something you cannot go under, because that would render the test meaningless.

The more practice the model has distinguishing actual results from coincidental outcomes, the better it should perform in non-training scenarios. The model you're using should give you a recommended number of training data points for efficient analysis - which is the least False Discovery Rate at the most optimal sensitivity.

I know this sounds vague - I hope someone can explain this in a better, more grounded fashion.

ADD COMMENTlink written 2.5 years ago by RamRS21k

Hi and thanks.

Imagine that we have only one transcript (or string of nucelotide sequence) and we want to check if it has the potential to code any protein (even theoretically),

Do we need to add a bunch of transcripts to it to received a more accurate answer ?

It is really bizarre !

ADD REPLYlink written 2.5 years ago by Farbod3.2k
1

Hi Farbod: As long as you have a DNA sequence you can translate it into a protein (jn all 6 frames if you want). I doubt there is any theoretical method that is going to give you a "confidence prediction" (setting aside similarity searches/modeling since we have gone over those already in other threads) that the peptide(s) you see is going to be actually present in your fish.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax67k

Exactly. When we were assembling transcriptomes, we would translate each putative transcript in all ORFs, pick the largest protein coding ORF and BLAST it against related organisms. (we actually pooled the transcripts and reciprocal BLAST-ed them to a related organism-db so we could be more confident)

ADD REPLYlink written 2.5 years ago by RamRS21k

OK,

I am working on BLAST-LESS transcripts.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Farbod3.2k

What do you mean by BLAST-LESS transcripts?

ADD REPLYlink written 2.5 years ago by RamRS21k

I mean I performed the BLAST, those transcripts showed no hit. (hit-less)

ADD REPLYlink written 2.5 years ago by Farbod3.2k

Try a BLASTX against a relevant protein database.

ADD REPLYlink written 2.5 years ago by RamRS21k

I have done it against NCBI nr

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Farbod3.2k

It works better if your database has more of relevant sequences and not every single sequence in the known universe :)

ADD REPLYlink written 2.5 years ago by RamRS21k

have done it, before.

not much chance

ADD REPLYlink written 2.5 years ago by Farbod3.2k

Do you have any thoughts on why you don't see results?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by RamRS21k

Yes, I have assumed that (1) some of them are assembly/sequencing errors, and (2) maybe some of them are novel genes representative.

I intend to trap the second group using PCR.

For knowing if they worth to PCR, I begin with this point to check if they are coding.

please correct me if I miss somthing

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Farbod3.2k

Hi genomax2,

Please introduce me on a good software for "translate it into a protein".

I guess one of the best is Transdecoder.

And I want to know that which part of Morkove model formula is producing this restriction ?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Farbod3.2k
3

The Viterbi algorithm is at the heart of hidden Markov models, which for many bioinformatics applications involves profile hidden Markov models where the "profile" is a multiple pairwise alignment such that we can get frequencies at each position of your sequence. The fewer example sequences you have for training, the less representative that profile is of the variance that is potentially present in the data you want to query.

Algorithms like HMMER use Laplace smoothing and add one to the denominator of that frequency calculation, so what you'll get when using a single sequence as training data is a 0.5 frequency at each position that identically matches your training sequence. Though it can be done this way and sometimes produces surprisingly accurate results, you're almost better off using BLAST-like methods unless your question involves querying for highly degenerate sequences.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Steven Lakin1.4k

Dear Steven Lakin, Hi and thank you for your complete and informative answer.

I have found this page about Hidden Markov Model, would you please tell me :

1- Is the softwares same as Transdecoder and EMBOSS-getORF use Markov model or Hidden Markov Model ?

2- at the wiki page I have mentioned above, which part of formula depends to "frequencies at each position of sequence" ? OR I must ask the same question for " Viterbi algorithm" ?

Take Care

ADD REPLYlink written 2.5 years ago by Farbod3.2k

EMBOSS' getorf can do this for you.

ADD REPLYlink written 2.5 years ago by RamRS21k

Dear Ram, Hi

Do you know any paper about comparison of ORF-finder softwares ?

ADD REPLYlink written 2.5 years ago by Farbod3.2k

No I don't, sorry.

I do have a question for you - why do we need probabilistic models for ORF prediction? Is the genetic code different for your species? Creating sets of 3 starting at seq[0],seq[1],seq[2],revcomp(seq)[0],revcomp(seq)[1] and revcomp(seq)[2], then translating them and finding the longest protein sounds like a pretty straightforward computation to me - am I missing something here?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by RamRS21k
1

@Farbod is trying to reduce the number of experiments that need to be the done (as best as I can tell).

At some point (bio)informatic options cease to provide useful hypotheses and one has to go back to the experimental bench to test/discredit hypotheses at hand. @Farbod seems to be having a hard time reconciling with that fact.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax67k

Dear genomax, Hi

I can not run PCR for 200 transcripts at now, So I need to choose some of them wisely. I begin with checking coding ability of these 200 string of nucleotides.

And your hypothesis about "having a hard time reconciling with that fact" is not true, but do not intend to spend money for nothing.

thank you anyway.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Farbod3.2k

Hi Farbod: Following will sound unpleasant but there is no other way to say this. You are at a point where you need to go ahead and choose as many PCR's as you can afford to do and start some experiments. You should quickly get an answer to your question: Is it worth going forward with rest?

I suspect there is not much useful left on informatics end (after all the work you have put in) to help you narrow the selection that will have guaranteed success.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax67k

Makes sense. As an aside, is there any reason one would need an HMM to find ORFs?

ADD REPLYlink written 2.5 years ago by RamRS21k

Hi,

Unfortunately I am not familiar with this "seq[0],seq[1],seq[2],revcomp(seq)[0],revcomp(seq)[1] and revcomp(seq)[2]" approach.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Farbod3.2k

What is an approach to finding ORFs that you are familiar with? Algorithm-wise, that is?

ADD REPLYlink written 2.5 years ago by RamRS21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1485 users visited in the last hour