Step 1, generate all possible k-mers of the chosen size

Question

Finding 16 mer not present in GRCh38

2

Entering edit mode

5.4 years ago

Srw ▴ 60

I'm interested in creating list of kmers with length of 16 that are NOT present in the human reference GRCh38. I've attempted to use jellyfish but the lower limit for this approach is 1.

Any thoughts or examples on approaches would be greatly appreciated.

kmer genome • 2.8k views

ADD COMMENT • link updated 9 months ago by schlogl ▴ 160 • written 5.4 years ago by Srw ▴ 60

0

Entering edit mode

Any success with the proposed approaches?

ADD REPLY • link 5.3 years ago by ATpoint 82k

0

Entering edit mode

I found out 14-mers on my 48GB RAM PC, you can easily extend them to 16-mers. A: How to generate a short sequence that does not align to the RefSeq? .Email me if you need. shenwei356 at gmail

ADD REPLY • link 5.3 years ago by shenwei356 8.4k

score 4 · Answer 1 · 2018-12-17

4

Entering edit mode

5.4 years ago

ATpoint 82k

Note: Edited the answer based on Joe's comment below.

Given that there are 4294967296 possible kmers (4^16) for your situation, standard pattern matching will probably be way too slow. I suggest you make a multifasta like

>kmer1
ATCG....
>kmer2
GATA....

...and pipe this directly into bowtie for alignment against hg38.

Write this python code to a file kmers.py (modified from a solution @ StackExchange and Joe's comment below),

import itertools
combinations = itertools.product(*itertools.repeat(['A','T','C','G'], 16))
for i in combinations:
    print ">"+''.join(i)
    print(''.join(i))

and then run:

./kmers.py | bowtie --sam --best --strata -v 0 -n 0 -l 16 -k 1 -m 1 -M 1 -f bowtie_index /dev/stdin | mawk '$1 ~ /^@/ {next} {if ($6 != "16M") print $10}' | tee >(pigz > kmers_notHG38.txt.gz) | wc -l /dev/stdin

It uses bowtie to align it against hg38, requiring end-to-end alignment and setting all mismatch penalties to maximum strict to only allow exact and full-length matches. Next, it filters for everything that has no perfect match CIGAR (16M) and then save the kmers that are not in hg38 to a file kmers_notHG38.txt.gz. It will also print the numbers of kmers in the output once the job is complete.

If you want to speed up things, bowtie can do multithreading with the -p option.

ADD COMMENT • link 5.4 years ago by ATpoint 82k

1

Entering edit mode

Not answering the question directly, but to generate the list of kmers this python solution should be considerably faster (but will still take a long time):

import itertools
combinations = itertools.product(*itertools.repeat(['A','T','C','G'], 16))
for i in combinations:
    print(''.join(i))

ADD REPLY • link 5.4 years ago by Joe 21k

1

Entering edit mode

A fast way to generate all k-mers for given k (k<=32) using unikmer:

# 4294967295 == ( 1<<(16*2) ) - 1
$ seq 0 4294967295 | unikmer decode -k 16 | head -n  3
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAG

FASTA format

$ seq 0 4294967295 | unikmer decode -k 16 -a | seqkit tab2fx | head -n 6
>0
AAAAAAAAAAAAAAAA
>1
AAAAAAAAAAAAAAAC
>2
AAAAAAAAAAAAAAAG

ADD REPLY • link 5.3 years ago by shenwei356 8.4k

0

Entering edit mode

~~this answer had been deleted~~

ADD REPLY • link 5.3 years ago by shenwei356 8.4k

2

Entering edit mode

Just fyi, the solution I posted above took about 2h on our HPC, using 8 threads for bowtie. Doing so, the bottleneck was the alignment rather than the kmer creation, so I would suggest not saving to disk and rather stream it, only keeping the non-matching kmers.

ADD REPLY • link 5.3 years ago by ATpoint 82k

0

Entering edit mode

Good to hear that, mapping is the fastest and most scalable solution.

ADD REPLY • link 5.3 years ago by shenwei356 8.4k

score 2 · Answer 2 · 2018-12-17

2

Entering edit mode

5.4 years ago

Pierre Lindenbaum 161k

brute force. It's slow and consummes I/O

I tested it with K=8 and a fasta file from E. Coli

https://gist.github.com/lindenb/866410c3fa9babe7e3f3ca5fcf53d6b5

ADD COMMENT • link 3.4 years ago by Pierre Lindenbaum 161k

score 2 · Answer 3 · 2019-01-20

Thanks for the question - it has kept me busy this Sunday morning / afternoon. As implied by others, this poses a computational challenge but is not insurmountable.

For motif searching generally, I usually use AWK. My approach here was to:

generate all possible k-mers of the chosen size (run once only)
separately, tabulate k-mers of the chosen size in the hg38 sequence
get the difference between #1 and #2

I should also note that I have only done this for 12-mers and for just chr22. For 16-mers, I'd likely have to use a cluster environment. Over the entire genome, I would continue to break the analysis up by chromosome.

Step 1, generate all possible k-mers of the chosen size

# generate all possible 12-mer
# 'expected' must be set as maximum possible (for 12-mer, expected = 4^12)
awk -v atgc=ATGC -v expected=16777216 -v range=4 'BEGIN{
  srand(565447);
  do {
    seq = substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)
    if (!(seq in motifs)) {
      print seq
      motifs[seq] = 1
      count++
    }
  } while (count < expected)
}' > all.12mer.out

NB - each random base is produced by substr(atgc,int(1+rand()*range),1). In this case, for a 12-mer, there are 12 of these assigned to seq.

For 16-mers, you need 16:

    awk -v atgc=ATGC -v expected=4294967296 -v range=4 'BEGIN{
      srand(565447);
      do {
        seq = substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)substr(atgc,int(1+rand()*range),1)
        if (!(seq in motifs)) {
          print seq
          motifs[seq] = 1
          count++
        }
      } while (count < expected)
    }' > all.16mer.out

all.12mer.out looks like:

head all.12mer.out 
CACCCTGATCCT
ACAATGTTCGTT
GTGACGGTCGCT
TTAGGCACCGAT
GCTCATATTTAT
GCGGGTTAGATG
GTGGTTGCGCAA
GATCTAAAGTGG

There are 16,777,216 possible 12-mers. This step takes a while (20 minutes) and could be improved in efficiency. The output file is ~220MB. The good thing is that this step only needs to be run once only.

Step 2, separately, tabulate k-mers of the chosen size in the hg38 sequence (just chr22)

In this step, the entire target sequence is first flattened and converted to upper case. Then, each found 12-mer is added to an indexed AWK array, with the entry's value incremented when the same 12-mer is found again. It moves in a sliding window of 1bp:

 ATGCATGCATGCATGCATGCATGC
 ATGCATGCAT
 |_________|
   TGCATGCATG
   |_________|
     GCATGCATGC
     |_________|

samtools faidx hg38.fasta chr22 | \
awk '{if (NR>1) printf toupper($0)}' | \
awk '{
  for (i=1; i<=NF; i+=1)
    if (i+11<=NF) {
      m12[$(i)$(i+1)$(i+2)$(i+3)$(i+4)$(i+5)$(i+6)$(i+7)$(i+8)$(i+9)$(i+10)$(i+11)]+=1
    }
  } END {
    for (motif in m12)
      print motif","m12[motif]
  }' FS='' > hg38.12mer.out

hg38.12mer.out contains each identified 12-mer and its frequency

head hg38.12mer.out
GCTCTCCCGCAG,2
TCGCTATATTGT,1
GAAACACGCCCT,2
AACCCTGGGAGT,2
GGCAGTCAGGAT,7
TCCCACTCGGCA,2
GTGAATTCTTCT,5
TCCCACTCGGCC,5

For 16-mer, you would need to modify to:

samtools faidx hg38.fasta chr22 | \
awk '{if (NR>1) printf toupper($0)}' | \
awk '{
  for (i=1; i<=NF; i+=1)
    if (i+15<=NF) {
      m16[$(i)$(i+1)$(i+2)$(i+3)$(i+4)$(i+5)$(i+6)$(i+7)$(i+8)$(i+9)$(i+10)$(i+11)$(i+12)$(i+13)$(i+14)$(i+15)]+=1
    }
  } END {
    for (motif in m16)
      print motif","m16[motif]
  }' FS='' > hg38.16mer.out

Step 3, get the difference between #1 and #2

# identify difference between all possible 12-mers and those found in the target sequence
awk 'NR==FNR {hgmers[$1]; next} !($0 in hgmers)' FS="," hg38.12mer.out all.12mer.out

With that, there are, apparently, 7,228,791 12-mers that are not present in chr22 in hg38:

awk 'NR==FNR {hgmers[$1]; next} !($0 in hgmers)' FS="," hg38.12mer.out all.12mer.out | head
ACAATGTTCGTT
GTGACGGTCGCT
GCGGGTTAGATG
GTGGTTGCGCAA
GCACGACATTGA
GCACGAGTCAAG
GTCCGCGGGATT
GCCGTGTAAAAG
CGCTTGTCACAC
GGTACCGTTTGA

I originally checked for 8-mers but all 8-mers are accounted for in chr22 in hg38.

If I manage to do 16-mer, I'll post the solution...

Kevin

score 2 · Answer 4 · 2020-04-11

Can't believe I missed this question the first time around. Maybe my answer will be of use to others down the road.

I wrote C++ and Python versions of a tool called kmer-boolean a couple years ago that might be useful here: https://github.com/alexpreynolds/kmer-boolean

This tool enumerates all kmers for given length k and FASTA sequence from standard input. It returns whether kmers are "found" (--present), "not found" (--absent), or both cases: "found" or "not found" (i.e., --all).

The C++ version uses a custom bitset-like library, so as to reduce memory usage to 1 bit per unique kmer. Each kmer gets its own bit. If set to zero, that kmer is not found. If set to one, the kmer was found somewhere in the input FASTA.

A 16mer analysis will require 4^16 bits, or 2^29 bytes (if my math is right). Reserving that chunk of memory and parsing hg38 may take a little time.

Still, to build:

$ git clone https://github.com/alexpreynolds/kmer-boolean
$ cd kmer-boolean
$ make

Input example:

$ cat /tmp/foo.fa
> foo
GTTTCTTATTAATAATATCCTATATTTTCATTTCGG

Usage example for 4mers:

$ ./kmer-boolean --k=4 --absent < /tmp/foo.fa
AAAA not found
AAAC not found
...
GGGT not found
GGGG not found

Options --present and --all are available for the other two query cases.

If your computer has enough memory, in your case, you'd replace 4 with 16 and /tmp/foo.fa with the path to hg38.fa or its equivalent (e.g., available from UCSC goldenpath or other sources).

An important note, which I'm recopying from the Github README:

Note that searches are performed not on "canonical" DNA kmers, but all unique kmers. In other words, for example, a search for the AG 2mer will be treated separately from the reverse complement CT 2mer.

In the example above, for instance, AAAA is not found, although its reverse complement TTTT is found. This may or may not be important for you. A second-pass on the output may be needed to check if the reverse complement is in the --present or --absent set.