Question

How To Generate Multi-Nucleotide Occupancy Counts For Each Coordinate Of My Reads?

5

Entering edit mode

15.7 years ago

Biostar User ★ 1.0k

I need to generate nucleotide occupancy counts for each position of a given sequence then summed over each of the input sequences. An example desired output (for di-nucleotide AT):

dinucleotide occupancy

python nucleotide-frequency • 3.0k views

ADD COMMENT • link updated 19 months ago by Ram 45k • written 15.7 years ago by Biostar User ★ 1.0k

Ram · Answer 1 · 2009-10-05

The code snippet below will populate the store dictionary keyed by the nucleotide patterns and values as lists that contain the occupancy for each index. (Updated answer now includes arbitrary length nucleotide counts)::

from itertools import count

def pattern_update(sequence, width=2, store={}):
    """
    Accumulates nucleotide patterns of a certain width with 
    position counts at each index.
    """

    # open intervals need a padding at end for proper slicing
    size  = len(sequence) + 1

    def zeroes():
        "Generates an empty array that holds the positions"
        return [ 0 ] * (size - width)

    # these are the end indices
    ends = range(width, size)

    for lo, hi in zip(count(), ends):
        # upon encountering a missing key initialize 
        # that value for that key to the return value of the empty() function
        key = sequence[lo:hi]
        store.setdefault(key, zeroes())[lo] += 1

    return store

The code at multipatt.py demonstrates its use in a full program. Set the size to the maximal possible sequence size. A typical use case::

store = {}
seq1 = 'ATGCT'
pattern_update(seq1, width=2, store=store)    

seq2 = 'ATCGC'
pattern_update(seq2, width=2, store=store)    

print store

will print::

{'CG': [0, 0, 1, 0], 'GC': [0, 0, 1, 1], 'AT': [2, 0, 0, 0], 
'TG': [0, 1, 0, 0], 'TC': [0, 1, 0, 0], 'CT': [0, 0, 0, 1]}