Question

Blog:Exceptional Symmetry of DNA Sequences

0

Entering edit mode

7.4 years ago

kevberm • 0

Codondex Blog reports new sequence symmetry One example for a TP53 intron transcript of 400 nucleotides (letters) produced +77,000 subsequence's containing a total 10,735,712 letters. From this we made a striking observation of sequence symmetry that could not be observed without obtaining these subsequence's.

RNA-Seq next-gen-sequencing • 1.8k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 7.4 years ago by kevberm • 0

0

Entering edit mode

Umm, OK, so this seems to have something to do with kmers of various sizes or something...but it's difficult to parse out exactly what. Mostly, this looks like gibberish.

ADD REPLY • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

These are like a kmer, but variable starting from a 7 kmer. Assuming a sequence L=400 we compute 399 398, 397...7 kmers. Then, for every kmer we query in all kmers and count recurrence.

ADD REPLY • link 7.4 years ago by kevberm • 0

0

Entering edit mode

I tried, before and after drinking coffee, but I can't really see what the message is behind your blog post. Can you explain?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Coffee? You'll need scotch for this! These are like a kmer, but variable starting from a 7 kmer. Assuming a sequence L=400 we compute 399 398, 397...7 kmers. Then, for every kmer we query in all kmers and count recurrence. We found identical kmer recurrence pairs with one other kmer in around 20% of kmer instances. You can fiddle with this symmetry here - http://www.codondex.com/analysis/iscore/demo - Also, we count recurrence in bigger length and ignore length kmers and developed this ranking algorithm 'iScore' - [(ignore length kmer recurrence - bigger length kmer recurrence) / Length]. Despite length normalization, kmer recurrence retains length ordering for +95% kmers of the sequence. So far we found the ranking pinpoints specifics in existing research either miRNA's LINE/SINE, splicing junctions or other...

ADD REPLY • link 7.4 years ago by kevberm • 0

0

Entering edit mode

It sounds rather like you've observed one of the many many fingerprints of evolution.

ADD REPLY • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

I interpret this that you find kmers matching to repetitive sequences? Besides those, does this 'recurrence' show statistical significance (not sure how this is related to symmetry). In other words, is the recurrence of some non-repetitive 7-mers more than expected by chance?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

kmers are repetitive only by recurrence not by sequence text. We will run some random assessments and get back to you , but i doubt this will be chance. kmer image - look at col B [these are k-mer ID's and col L - is their recurrence count. These L=10 sequences each recur 913185 times in bigger length kmers of this transcript.

ADD REPLY • link 7.4 years ago by kevberm • 0

0

Entering edit mode

I think I still don't get it. The transcript ENST00000576024 is 2724bp long, how do you fit 913185x CTGGAGTGCA in there? Do be more precise, the kmer 'CTGGAGTGCA' doesn't occur a single time in the sequence of that transcript.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Presumably this is the recurrence in all transcripts/features/whatever that they looked at. But yeah, the presentation here isn't terribly clear.

ADD REPLY • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

Our focus has been on the Intron - you can access it here http://www.codondex.com/sequences/ENST00000576024 - you will see these two occurrences of CTGGAGTGCA and CTGCCTCAGC in the base sequence. Our pattern/amplification produced 1,408,680 k-mers (+150m letters) from the base sequence. Notwithstanding the letter differences, the two k-mers in col B #97022 and #127262 recurred 913185x for each k-mer in the query of all bigger length k-mers. Like this example, identical recurrence for different k-mer pairs occurs in ~15-20% of k-mers for multiple gene transcripts. Hope this helps to clarify.

ADD REPLY • link 7.4 years ago by kevberm • 0

score 0 · Answer 1 · 2016-12-14

So we did a random compare of 50 transcripts each using two methods. 1. random select from one of A,G,C,T - roughly a 25% distribution 2. we reordered (scrambled) letters in the sequence. Bottom line - Random select is significantly different by a factor of 2x. Reordered letters is close enough - no distinction. We did some further analysis of the symmetry for a men1 transcript T702. We found long interconnected symmetries which we identfy below as Group1-9. Between each group their is a gap of 1 or 2 non-symmetrical letters. We are investigating further.

Group1 [firstSequenceStart=0, firstSequenceEnd=83, secondSequenceStart=597, secondSequenceEnd=680] [[84, 85], [595, 596]] Group2 [firstSequenceStart=79, firstSequenceEnd=111, secondSequenceStart=569, secondSequenceEnd=601] [[112], [568]] Group3 [firstSequenceStart=106, firstSequenceEnd=143, secondSequenceStart=537, secondSequenceEnd=574] [[144], [536]] Group4 [firstSequenceStart=138, firstSequenceEnd=231, secondSequenceStart=449, secondSequenceEnd=542] [[232], [448]] Group5 [firstSequenceStart=226, firstSequenceEnd=247, secondSequenceStart=433, secondSequenceEnd=454] [[248], [432]] Group6 [firstSequenceStart=242, firstSequenceEnd=260, secondSequenceStart=420, secondSequenceEnd=438] [[261], [419]] Group7 [firstSequenceStart=255, firstSequenceEnd=269, secondSequenceStart=411, secondSequenceEnd=425] [[270], [410]] Group8 [firstSequenceStart=264, firstSequenceEnd=273, secondSequenceStart=407, secondSequenceEnd=416] [[274], [406]] Group9 [firstSequenceStart=268, firstSequenceEnd=411, secondSequenceStart=269, secondSequenceEnd=412]