Entering edit mode
                    3.9 years ago
        wm
        
    
        ▴
    
    570
    How to calculate the mapping rate of random 20-mer DNA onto human reference (hg19)?
in theory, there are 4^20 = 1e13 kinds of 20-mer DNA, (ignore reverse-complement); and 2045683739 = 2e9 unique  20-mer DNA in human reference genome. So it is about 2/1000 mapping rate?
Using below python code to generate random 20-mer DNA sequence;
Using bowtie with default parameters to evaluate the mapping rate;
It turns out ~73%, using the above method (not 2/1000).
# count k-mers in hg19 reference genome
$ jellyfish --version
jellyfish 2.3.0
$ jellyfish count -m 20 -s 100M -t 10 -C ~/data/genome/hg19/bigZips/hg19.fa
$ jellyfish stats mer_counts.js 
Unique:    2045683739
Distinct:  2184581457
Total:     2897301035
Max_count: 911595
# generate random sequence
import random
def get_rand_mer(s):
    return ''.join(['ACGT'[random.randrange(0, 4)] for i in range(int(s))])
def get_rand_fa(s, n, f):
    with open(f, 'wt') as w:
        for i in range(int(n)):
            fa = '>{}\n{}'.format(i, get_rand_mer(s))
            w.write(fa+'\n')