Question

Need some help to develop some python code to analyze kmer data

0

Entering edit mode

3 months ago

schlogl ▴ 160

I am working in a code for doing some statistical analysis in kmers. It is kind of over-underrepresented kmers in dna sequences. The statistical implementation takes in consideration all sub-words that are component inside the kmer/n-grams/n-tuples in analysis, like dimers (two letters words), trimers (three letters words), tetramers... I will called simply words1 and words2

I need help to implement a function or functions to help me out to:

1- receive as input two dna sequences (string1 and string2) and one alphabet (ACGT).

2- Compare the characters present in string2 that are present in string1 and get the index of each character present. I have this function for that:

  def get_shared_idxs(word1, word2):
    """
    > kmer = "ACGCAT"
    > get_shared_idxs(kmer, "AA")
    [0, 4]
    """
    shared = []
    for i, c in enumerate(word1):
        if c in word2:
            shared.append(i)
    return shared

2- Get a combinations of sub-words, for example: list_sub_words = ['AC','AG','AC','AA','AT','CG','CC','CA','CT', 'GC','GA','GT','CA','CT','AT']. I have this function to get all the subwords(words2) I need

def get_kmer_combo(kmer, alphabet, repeats=1):
    """
    > alphabet = "ACGT"
    > kmer = "ACGCAT"
    >get_kmer_combo(kmer, alphabet, repeats=1)
    ['AC', 'AG', 'AC', 'AA', 'AT', 'CG', 'CC', 'CA', 
    'CT', 'GC', 'GA', 'GT', 'CA', 'CT', 'AT']
    """
    return ["".join(c) for c in combinations(kmer, r=repeats)]

3- Then I compare both words (words1 and words2) and find in words2 the gap positions represented by index missing in words2 that are not in word1. For example, word2 = "AA", indices that are in word1 [0,4], so to AA x ACGCAT -> A CGC A It miss indices [1,2,3]. I have this function:

    def get_missing_idxs(word1, present_idxs):
        kmer_idxs = [i for i in range(len(word1))]
        return set(kmer_idxs).difference(set(present_idxs))

# for "AA"
    missing_index = [i for i in get_missing_idxs(kmer, [0,4]) if i < max([0,4])]
[1, 2, 3]

3- Then with the missing indexes I can generate generate combinations with the alphabet letters (according to the length of the missing positions in the case of "AA" = [1, 2, 3] ) and use it to fulfill the gaps or the missing positions in string2 (AA to get AnnnA = [AAAAA, AAACA, AAAGA, AAATA,AACAA...] until return the 4^3 combinations.

4- If for example the indices are AC = [0,1] I check for this consecutive indices

def is_consecutive(indices):
    """
    > is_consecutive([0,2,3]), is_consecutive([2,3,4])
    (False, True)
    """
    return sorted(indices) == list(range(min(indices), max(indices) + 1))

If the indices are consecutives I return the word2 as it is or I do not need fill it.

I could hard code some functions, but I want to generalize the code.Because I need all duplets, triplets, quadruplets and pentamers, from the word1 as show in the examples below. Take a look in some examples:

# number of combinations is the number of sub-words needed

**string1** = ACGCAT 
**indices** = [0,1,2,3,4,5]
**alphabet** = "ACGT"

**string2**    **indices**    `number of combinations`
  AC            [0,1]          1
  AG            [0,2]          4 # AAG ACG AGG ATG
  AC            [0,3]         16 # AAAC AACC AAGC AATC ...
  AA            [0,4]         64
  AT            [0,5]        256
  CA            [1,4]         16
  CT            [1,5]         64
  CG            [1,2]          1
  CC            [1,3]          4
  GC            [2,3]          1
  GA            [2,4]          4
  GT            [2,5]         16
  CA            [1,4]         16
  CT            [1,5]         64
  AC            [0,3]         16
  ACG           [0,1,2]        1
  ACC           [0,1,3]        4
  ACA           [0,1,4]       16
  ACT           [0,1,5]       64
  AGC           [0,2,3]        4
  AGA           [0,2,4]       16
  AGT           [0,2,5]       64
  ACA           [0,1,4]       16
  ACT           [0,1,5]       64
  AAT           [0,4,5]       64
  CGC           [1,2,3]        1
  CGA           [1,2,4]        4
  CGT           [1,2,5]       16
  CnA           [1,3,4]        4
  CCT           [1,3,5]       16
  CAT           [1,4,5]       16
  GCA           [2,3,4]        1
  GCT           [2,3,5]        4
  GAT           [2,4,5]        4
  CAT           [3,4,5]        1
  ACGC          [0,1,2,3]      1
  ACGA          [0,1,2,4]      4
  ACGT          [0,1,2,5]     16
  ACCA          [0,1,3,4]      4 
  ACCT          [0,1,3,5]     16
  ACAT          [0,1,4,5]     16
  AGCA          [0,2,3,4]      4
  AGCT          [0,2,3,5]     16
  AGAT          [0,2,4,5]     16
  ACAT          [0,1,4,5]     16
  CGCA          [1,2,3,4]      1
  CGCT          [1,2,3,5]      4
  CGAT          [1,2,4,5]      4
  CCAT          [1,3,4,5]      4
  GCAT          [2,3,4,5]      1
  ACGCA         [0,1,2,3,4]    1
  ACGCT         [0,1,2,3,5]    4
  ACGAT         [0,1,2,4,5]    4
  ACCAT         [0,1,3,4,5]    4
  AGCAT         [0,2,3,4,5]    4
  CGCAT         [1,2,3,4,5]    1

If you guys have some tips and want to share it, all help will be very well accepted.

Thank you for your time and help.

Observation: The "ns" were inserted in the string2 are just illustrative and they are not in the real sequences. They are the missing positions or gaps I need to fulfill with the letters of the alphabet.

string2    indices    number of combinations(4^n)
  AC        [0,1]          1 # consecutive indices return AC
  AnG       [0,2]          4 # AAG,ACG,AGG,ATG
  AnnC      [0,3]         16 #AAAC AACC AAGC AATC ACAC ACCC...
  AnnnA     [0,4]         64 # AAAAA AAACA AAAGA AAATA ...
  AnnnnT    [0,5]        256 # AAAAAT AAAACT AAAAGT AAAATT...
  CnnA      [1,4]         16
  CnnnT     [1,5]         64
  CG        [1,2]          1
  CnC       [1,3]          4
  GC        [2,3]          1
  GnA       [2,4]          4
  GnnT      [2,5]         16
  CnnA      [1,4]         16
  CnnnT     [1,5]         64
  AnnC      [0,3]         16
  ACG       [0,1,2]        1
  ACnC      [0,1,3]        4
  ACnnA     [0,1,4]       16
  ACnnnT    [0,1,5]       64
  AnGC      [0,2,3]        4
  AnGnA     [0,2,4]       16
  AnGnnT    [0,2,5]       64
  ACnnA     [0,1,4]       16
  ACnnnT    [0,1,5]       64
  AnnnAT    [0,4,5]       64
  CGC       [1,2,3]        1
  CGnA      [1,2,4]        4
  CGnnT     [1,2,5]       16
  CnCA      [1,3,4]        4
  CnCnT     [1,3,5]       16
  CnnAT     [1,4,5]       16
  GCA       [2,3,4]        1
  GCnT      [2,3,5]        4
  GnAT      [2,4,5]        4
  CAT       [3,4,5]        1
  ACGC      [0,1,2,3]      1
  ACGnA     [0,1,2,4]      4
  ACGnnT    [0,1,2,5]     16
  ACnCA     [0,1,3,4]      4 
  ACnCnT    [0,1,3,5]     16
  ACnnAT    [0,1,4,5]     16
  AnGCA     [0,2,3,4]      4
  AnGCnT    [0,2,3,5]     16
  AnGnAT    [0,2,4,5]    16
  ACnnAT    [0,1,4,5]    16
  CGCA      [1,2,3,4]     1
  CGCnT     [1,2,3,5]     4
  CGnAT     [1,2,4,5]     4
  CnCAT     [1,3,4,5]     4
  GCAT      [2,3,4,5]     1
  ACGCA     [0,1,2,3,4]   1
  ACGCnT    [0,1,2,3,5]   4
  ACGnAT    [0,1,2,4,5]   4
  ACnCAT    [0,1,3,4,5]   4
  AnGCAT    [0,2,3,4,5]   4
  CGCAT     [1,2,3,4,5]   1

Python R • 338 views

ADD COMMENT • link 3 months ago by schlogl ▴ 160

score 0 · Answer 1 · 2024-01-23

0

Entering edit mode

3 months ago

colindaven 6.4k

I think you can look at a known algorithm such as knuth-morris-pratt for some inspiration. Aho-corasick might be worth a look as well.

eg https://github.com/TheAlgorithms/Python/blob/master/strings/knuth_morris_pratt.py

You can look at over and under-representation of kmers using a chi-squared stats test.

ADD COMMENT • link 3 months ago by colindaven 6.4k

0

Entering edit mode

colindaven Hi there. Thanks for your replay. The statistics I am trying to emulate are the ones karlin and co-workers karlin, Elhai Elhai used a while ago and also Pevzners et al., where they take in account all sub-words in consideration.

ADD REPLY • link 3 months ago by schlogl ▴ 160