Question: Distinguishing Between Real Dna Strings Vs Random Dna Strings
3
gravatar for Zakir Naik
7.1 years ago by
Zakir Naik30
Zakir Naik30 wrote:

I have come up with a in-silico method to differentiate between a real DNA string (originating from a real genome) and a computer generated random DNA string. The program does not perform any alignments with sequences in existing databases and takes no more than a minute (on a 2.33 GHz, 2GB RAM) desktop for completing the job.

Only condition: The length of the input strings should be at least 5000 characters.

Are there any practical applications of this method ?

genomics dna • 1.8k views
ADD COMMENTlink modified 7.1 years ago by brentp23k • written 7.1 years ago by Zakir Naik30
1

Did you just do it for fun? Just curious.

ADD REPLYlink written 7.1 years ago by Madelaine Gogol5.0k

Guess it was fun to do. But did you try your method on structured english text vs randomly generated alphanumeric strings ? Me too just curious.

ADD REPLYlink written 7.1 years ago by Monzoor290

Yes, I did start this for fun. But the near 100% accurate results I got made me think whether the overall method had any practical value. I've now got to modify the technique so that it can work with smaller strings now. Though it may not be easy, the modified method may be useful for weeding out sequencing artefacts in NGS data sets

ADD REPLYlink written 7.1 years ago by Zakir Naik30

Thanks Monzoor. Will check if I can tune it for English text vs random text. BTW what are the applications of such a program (if ever it works for English vs random text)

ADD REPLYlink written 7.1 years ago by Zakir Naik30
4
gravatar for brentp
7.1 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

I assume this is some sort of supervised learning?

I can think of some questions I'd try to answer (which ignore any biology):

Can you extract from your classifier something that tells you what it is that differentiates real sequences from generated sequences?

How much can you permute the a real sequence until it is classified as "fake"? Does it take 10 mutations? 100 mutations? 1 inversion?

Does the classifier report a probability? So can you find sequences that have low/high probability?

How low can you reduce the size of the learning set and still get good results? If you only use sequences from fungus, can it still recognize real human sequences?

ADD COMMENTlink written 7.1 years ago by brentp23k

Your comments were very interesting. 1. Yes it is supervised learning. Calculations were based on overall entropy of given sequences. It had nothing to do with biological features like GC content, codon usage etc. 2. Permutation value is something I will try out and let you know. 3. The idea of adding a probability score is the best suggestion, I have received so far. Will do that right away 4. Both fungus and human sequences are real biological sequences. My method as of now cannot distinguish between them. But this observation of yours has given me a lot of food for thought.

ADD REPLYlink written 7.1 years ago by Zakir Naik30
2
gravatar for Philippe
7.1 years ago by
Philippe1.9k
Barcelona, Spain.
Philippe1.9k wrote:

Hi,

so far I don't see any practical application. That does not mean there are none of course. We generally have to deal more with biological contamination rather than "digital" contamination.

The only thing I can think about is to check if a sequencing machine did not generate any artefact/random sequence while writing its output. However the interest is still limited because:

  • this is highly unlikely to happen
  • for high-throughput experiment you can tolerate that some reads won't be used (i.e. they won't map to the reference genome or are not likely to strongly support some non-existing contig for de novo assembly)
  • so far, no technologies can sequence 5000 nucleotides in a row so your method won't be useable. Oxford Nanopore Technologies claim they can but so far no data is publicly released
  • the generation of a non-existing sequence by a machine does not mean it has been randomly generated, it can be the result of the concatenation/shuffling of actual reads.

The algorithm can still be useful as an example of machine learning for example (even though I don't know which method you used) but does not seem, to my opinion, to have any immediate practical application.

ADD COMMENTlink written 7.1 years ago by Philippe1.9k

At your point 4, I don't think he is looking at non-existing sequences (he doesn't do any alignment). I suppose he looks at GC content/placing, codon bias etc that you don't expect in random sequences.

ADD REPLYlink written 7.1 years ago by Niek De Klein2.5k

At your point 4, I don't think he is looking if the sequence exists (he doesn't do any alignment). I suppose he looks at GC content/placing, codon bias etc that you don't expect in random sequences. I have no idea when anyone would have a totally random sequence and not know that it isn't DNA though.

ADD REPLYlink written 7.1 years ago by Niek De Klein2.5k

I did not make this assumption. I wanted to say that if you want to detect a machine's erroneous output, his method will only be able to detect if the machine wrote something totally randomly, and not if the resulting erroneous sequence is actually a sort of mix of different other (biological) sequences. Sorry if that was not clear.

ADD REPLYlink written 7.1 years ago by Philippe1.9k

Thanks to all for the comments/perspectives. One more naive question. Assuming, I tweak the method to work for say 2000 bp strings, does it have any value ?

ADD REPLYlink written 7.1 years ago by Zakir Naik30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 790 users visited in the last hour