Mapping rate of random 20-mer DNA to human genome.
0
1
Entering edit mode
2.3 years ago
wm ▴ 550

How to calculate the mapping rate of random 20-mer DNA onto human reference (hg19)?

in theory, there are 4^20 = 1e13 kinds of 20-mer DNA, (ignore reverse-complement); and 2045683739 = 2e9 unique 20-mer DNA in human reference genome. So it is about 2/1000 mapping rate?

Using below python code to generate random 20-mer DNA sequence;

Using bowtie with default parameters to evaluate the mapping rate;

It turns out ~73%, using the above method (not 2/1000).

# count k-mers in hg19 reference genome

$ jellyfish --version
jellyfish 2.3.0

$ jellyfish count -m 20 -s 100M -t 10 -C ~/data/genome/hg19/bigZips/hg19.fa
$ jellyfish stats mer_counts.js 
Unique:    2045683739
Distinct:  2184581457
Total:     2897301035
Max_count: 911595
# generate random sequence

import random

def get_rand_mer(s):
    return ''.join(['ACGT'[random.randrange(0, 4)] for i in range(int(s))])


def get_rand_fa(s, n, f):
    with open(f, 'wt') as w:
        for i in range(int(n)):
            fa = '>{}\n{}'.format(i, get_rand_mer(s))
            w.write(fa+'\n')
alignment random • 423 views
ADD COMMENT

Login before adding your answer.

Traffic: 2842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6