DNA composition - all k-mers and their frequency in some sequencing data
4
6
Entering edit mode
6.1 years ago
John 13k

Hi :)

A few weeks ago I was looking for a tool that would help me get "DNA composition" statistics for my sequencing. Something that would give me a dataset which which I could ask questions about GC bias, or over-represented sequences, motifs, etc. There are tools to answer each question specifically, but I was looking for something more general from which many analyses could be built on. This led me to k-mer counting, and all the down-stream tools which leverage k-mer result files.

k-mer counting tools are pretty cool, but all the ones I tried had some draw backs; like high RAM requirements, very long run-times (although the latests bloom-filter based tools seem to mitigate this somewhat), but most importantly requiring a specific k-mer size to be chosen. I really wanted all mers in the dataset so I could look at ''GC' and 'GCG' and 'GCCGACGGACGAC' without having to re-run any analyses. I couldn't find a tool like this after a brief search, so gave up and wrote my own in NumPy based off suffix-arrays.

Two weeks later I have a functional program in the sense that I get results, but before I invest any time making it usable for others, I thought I should investigate further if there are tools which already do this. Making a suffix array was a nice learning experience for me so I haven't lost anything if such tools already exist - and if they do i'd love to compare performance characteristics - but if not I might consider tidying up the code and making proper documentation. Does anyone know of such tools?

Thank you so much, and happy Diwali :)

sequencing • 5.7k views
1
Entering edit mode

It seems like no, there are no tools with the same functionality as yours. What is the range of kmers your tool covers?

0
Entering edit mode

Well theoretically all ranges; but practically a row of data (node in a trie) can only store so much DNA before it gets split into two or more rows/nodes. So with 64bits for storing DNA, the sweet-spot for DNA length is around 50-60bp. For longer DNA fragments you'd have to increase that, but i mean you could store a 1000mer in a 32bit output, it would just take 1000/16 rows to store it :)

1
Entering edit mode

did you finally share your code or not ? :D how and where to use that ? thanks

1
Entering edit mode

Yes and no. Yes it's out there (https://github.com/JohnLonginotto/ACGTrie) But it isn't really ready for anyone else to use unless they know some python. It will make the trie, and give you the tools to query it, but you would have to write your own scripts right now to do anything more than that (like compare two samples, etc).

1
Entering edit mode

thanks, seems i need to learn python to survive this days :))

0
Entering edit mode

This seems like a very specific application. Could you just write a bash script that launches Jellyfish several times with different k?

0
Entering edit mode

This would probably work. My only concern is that you cant merge files of different k-mer sizes in Jellyfish, so working with 100+ files per sample would be a pain.

I also worry that it would be kind of slow. I tried to see how long it would take to process 12800000 fragments (all < 20bp) which is roughly 0.25% of one seq file, and by the time it had done the 4mers the (much slower) python script had finished. It's not really a fair comparison for jelly because it had to initialize and read the whole file over and over again, but when it gave a memory leak at 6mer and 7mer sizes i gave up with the approach.

Also, I dont think the n-mer question is a specific application - rather I think it is the more general problem, of which k-mers for graph building/etc is the more specific application. Problem with the specific question I asked here on Biostars is that if no one posts a tool than can do it, that doesn't mean a tool doesn't exist - just no one posted :/

4
Entering edit mode
6.1 years ago
SES 8.5k

This is an interesting question because even though there are dozens of k-mer tools available they do all seem to have the same limitations: very short k-mer length requirements (usually <32bp) and fixed length search. It seems like there are some new tools reporting performance improvements but they still seem to have those features. Tallymer (part of GenomeTools) is the only tool, to my knowledge, that can compute statistics for an arbitrary length (or range) of k-mers. It sounds like you want the 'occratio' subcommand, which runs for a range of k-mers (as opposed to the 'mkindex' command which runs for a fixed length). Here is an example to get a range of unique/nonunique k-mers and get the relative amounts.

gt tallymer occratio -output unique nonunique -minmersize 10 -maxmersize 180 -esa db
gt tallymer occratio -output unique relative -minmersize 10 -maxmersize 180 -esa db

Note, you have to create 'db' first, but this was just an example. I can't go into the details, but I know that Tallymer uses a data structure called an enhanced suffix array. This is by the same author of Vmatch, and MUMmer (and many other tools). I should mention that Vmatch is also amazingly flexible for doing k-mer arithmetic and we could probably come up with an equivalent command using that tool. I agree with you about learning by doing, but this is a problem I would personally not try to solve since there are great tools available that allow you to answer virtually any question you may have about genomes and k-mers.

1
Entering edit mode

Here's a link to the Tallymer paper: http://www.biomedcentral.com/1471-2164/9/517

0
Entering edit mode

Ah, fantastic, i will definitely check out Tallymer right now!! :D Thank you SES!

3
Entering edit mode
6.1 years ago
lh3 32k

For fixed k-mer, use kmc2 and DSK, especially when you have a fast disk. They are faster and use less memory than jellyfish.

If you don't want to use fixed k-mer, the right data structure is suffix array, FM-index or something equivalent. You can query any sequences up to the read length. Nonetheless, for a particular k-mer size, it is usually not as efficient as k-mer based methods.

0
Entering edit mode

I'll have to look into FM-indexes because I dont know what they are, but you're right, the suffix array will never be as efficient as the hash table. Do you know any tools that use a suffix array or FM-index though?

2
Entering edit mode

3
Entering edit mode
6.1 years ago

BBMap has a very fast program for kmer-counting, KmerCountExact. Usage:

kmercountexact.sh in=reads.fq out=counts.fasta k=20


An optional flag prefilter=2, for example, will use Bloom filters to ignore kmers with counts of 2 or less. Unfortunately, it does require a fixed kmer length per run, but at least for short kmers, it uses only a trivial amount of memory. It does not use disk at all for temp files.

However, it's hard to give any more specific advice as I'm not sure what you're trying to do. If I want to look at GC characteristics of data, for example, I'd probably want to plot a gc content histogram of reads, or a gc-vs-coverage plot, or something like that... and text files full of kmers and their counts don't really seem overly useful for those applications, unless you are dealing with very short kmers for your motif-overrepresentation question..

1
Entering edit mode

You just keep suprising me with this toolset! Thanks for letting me know!

0
Entering edit mode

I will definitely check this out as although jellyfish is an excellent tool, I don't have the RAM to use all the time.
Re: GC characteristics, there are a number of analyses you can do with k-mer info which you're right you can also do in other ways (like a per-read analysis of GC%), but the nice thing about k-mer analyses is that you just generate your k-mer output once, and then lots of analyses can be run from it very quickly without the need to re-process the raw data. The tool KAT got me interested in this initially (https://github.com/TGAC/KAT), although its specific to jellyfish outputs. I think KAT just scratches the surface of what you could do with a full k-mer map.

1
Entering edit mode

If you are working with human whole-genome data – for in-memory k-mer counting, nothing beats jellyfish in terms of speed; for disk-based k-mer counting, nothing beats kmc2/dsk. BBmap is never shown in benchmarks, so I don't know.

1
Entering edit mode

Well, here's some benchmark data for you!

Methodology:

I copied 7.1GB of HiSeq 2x150bp reads (20 million) from a simple metagenome (26 bacteria) to ramdisk, on 128GB 32-core nodes. Then I ran these commands:

kmercountexact.sh t=32 k=31 in=reads.fq out=counts.fa
jellyfish count -m 31 -s 1000000 -t 32 --text reads.fq

KmerCountExact (v35.67):
735.16 user, 44.33 system, 95.75 elapsed

Jellyfish (v2.1.4):
2071.09 user, 31.52 system, 111.31 elapsed


Using 60-mers:

KmerCountExact:
1382.44 user, 60.86 system, 154.48 elapsed

Jellyfish:
2484.95 user, 47.24 system, 217.88 elapsed


So it looks to me like something DOES beat jellyfish in terms of speed, as well as being dramatically more CPU-efficient. If anyone wants the dataset, I can upload it to my gdrive. Note that 2/3rds of the time was spent dumping kmers, for KmerCountExact; the counting only took 30 seconds (in the 31-mer case). But I didn't see a way to disable that for Jellyfish, so I had to leave it turned on for both of them. Still, these are the numbers for plotting a kmer frequency histogram but not dumping kmers as text:

KmerCountExact:
690.17 user, 20.50 system, 33.81 elapsed


Edit - the fastq data is here

0
Entering edit mode

How did you calculate time that it takes? I want to rerun this analysis including kmergenie, another tool I have been using.

0
Entering edit mode

I used memtime.

It's not perfect - the memory is not calculated correctly when you launch a subprocess from a shellscript, but the times are still correct. The memory numbers are correct for C programs, though.

Usage:

memtime <command>


For example:

memtime jellyfish count -m 31 -s 1000000 -t 32 --text reads.fq


memtime kmercountexact.sh blah blah will correctly calculate the times for KmerCountExact, but under-report the memory. Still, you can't get a useful memory figure from a Java program anyway, due to the automatic garbage collection; it's non-deterministic. The only way to determine how much memory was actually needed is by progressively running with lower and lower -Xmx values until it crashes. Just like how you determine the load limit of a bridge.

0
Entering edit mode

Don't all these programs dynamically increase the memory usage as they read the files? My program just would not work if you had to specify upfront the amount of memory to use. Or rather, it would crash a lot :P

0
Entering edit mode

Yes, they dynamically increase memory usage as needed. But, KmerCountExact has a prealloc flag, and Jellyfish's -s parameter allows you to specify the initial size of the hash table, so with either one you can force them to start using all available memory, which is more efficient than growing dynamically if you are ultimately going to use most of the available memory.

1
Entering edit mode

Well, in terms of time, jellyfish and KmerCountExact were about the same for my dataset. Log:

@OHRI-crick ~/analysis/kmercounting $time ~/programs/bbmap_35.66/kmercountexact.sh t=16 k=23 in=/scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq fastadump=f khist=wtCop_bbtools.hist java -ea -Xmx267370m -Xms267370m -cp /home/apelin/programs/bbmap_35.66/current/ jgi.KmerCountExact t=16 k=23 in=/scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq fastadump=f khist=wtCop_bbtools.hist Executing jgi.KmerCountExact [t=16, k=23, in=/scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq, fastadump=f, khist=wtCop_bbtools.hist] Set threads to 16 Initial: Memory: max=268676m, free=263069m, used=5607m Executing kmer.KmerTableSet [t=16, k=23, in=/scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq, fastadump=f, khist=wtCop_bbtools.hist] Set threads to 16 Initial: Ways=41, initialSize=128000, prefilter=f, prealloc=f Memory: max=268676m, free=261667m, used=7009m Initialization Time: 0.292 seconds. Before loading: Memory: max=268676m, free=261667m, used=7009m Estimated kmer capacity: 10461215479 After loading: Memory: max=268676m, free=226622m, used=42054m Input: 3730537 reads 408465832 bases. Unique Kmers: 5748941 Load Time: 14.464 seconds. Reads Processed: 3730k 257.92k reads/sec Bases Processed: 408m 28.24m bases/sec Input: 3730537 reads 408465832 bases. For K=23 Unique Kmers: 5748941 Load Time: 14.467 seconds. Histogram Write Time: 0.134 seconds. real 0m16.209s user 1m15.025s sys 0m3.725s @OHRI-crick ~/analysis/kmercounting$ time jellyfish count -m 23 -s 1000000 -t 16 /scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq

real    0m12.692s
user    3m0.485s
sys     0m0.913s

@OHRI-crick ~/analysis/kmercounting $time ~/programs/kmergenie-1.6982/kmergenie --one-pass -k 23 -l 23 -e 1 -t 16 -o wtCop_kmergenies_k23 /scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq running histogram estimation Extrapolating number of distinct kmers 10000Linear estimation: ~40 M distinct 22-mers are in the reads K-mer sampling: 1/1 | processing | [going to estimate histograms for values of k: 23 ------------------------------------------------------------------------------------------------------------------------------Total time Wallclock 216.952 s fitting model to histograms to estimate best k Caught exception in fit_histogram worker thread (histfile = wtCop_kmergenies_k23-k23.histo): Traceback (most recent call last): File "/home/apelin/programs/kmergenie-1.6982/scripts/decide", line 62, in fit_histogram rc, stdout, stderr = run(command) File "/home/apelin/programs/kmergenie-1.6982/scripts/decide", line 43, in run process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE) File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__ errread, errwrite) File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory Traceback (most recent call last): File "/home/apelin/programs/kmergenie-1.6982/scripts/decide", line 105, in <module> outputs = p.map(fit_histogram, list_histograms) File "/usr/lib64/python2.6/multiprocessing/pool.py", line 148, in map return self.map_async(func, iterable, chunksize).get() File "/usr/lib64/python2.6/multiprocessing/pool.py", line 422, in get raise self._value OSError: [Errno 2] No such file or directory Execution of 'scripts/decide' failed (return code 1). If this is a fresh Kmergenie install, try running 'make check'. real 3m37.254s user 3m40.573s sys 0m0.882s  However, the histogram produced by jellyfish was rather odd, and did not correspond to the one produced by kmergenie or kmercountexact ADD REPLY 1 Entering edit mode You need to use -C. Don't know why this is not the default. ADD REPLY 0 Entering edit mode Agreed. Also could explain why jelly is not as fast since its doing twice the work without -C. ADD REPLY 1 Entering edit mode I think you meant why it is the fastest (we need to be looking at the real value right?). If I do -C I do get a slight increase in time for jellyfish: @OHRI-crick ~/analysis/kmercounting$ time jellyfish count -m 23 -s 1000000 -t 16 -C /scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq

real    0m15.005s
user    3m35.300s
sys    0m0.867s


Thanks for the remark, indeed adding -C makes it super similar to the other 2. I actually went through the manual and didn't catch that, this is embarrassing.

1
Entering edit mode

There are still two additional things. Firstly, -s. Usually you'd want to make it close to the actual k-mer counts. A proper starting -s will make jellyfish noticeably faster. Secondly, --text. Jellyfish doesn't output counts in text format. Printing k-mers is slow. Adding --text will make jellyfish slower. Anyway, to count 31/61-mer appearing >=2 times from my data (~30X c.elegans), jellyfish is still faster and takes far less RAM - I know it is hard to know how much RAM a java tool use. In addition, bbmap uses ~10-16 CPU cores when I asked it to use 8 only - I know this is another GC thing we can't control. In all, I am impressed that bbmap is that fast, but I would conclude jellyfish is better all around. KMC2 is faster than both, provided that you have a fast disk.

1
Entering edit mode

Here is my final benchmark, multiple thread counts on a 1.1 GB file. Previously I forgot to load the Java module.

# KmerCountExact - Threads 8
@OHRI-crick ~/analysis/kmercounting/t32 $time ~/programs/bbmap_35.66/kmercountexact.sh t=8 k=23 in=/scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq fastadump=f khist=wtCop_bbtools_k23_t8.hist real 0m7.107s user 0m53.124s sys 0m2.530s # Jellyfish - Threads 8 @OHRI-crick ~/analysis/kmercounting/t32$ time jellyfish count -m 23 -s 1000000 -t 8 -C /scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq; mv mer_counts.jf wtCop_jellyfish_k23_t8.jf
real    0m21.766s
user    2m43.030s
sys    0m0.966s

@OHRI-crick ~/analysis/kmercounting/t32 $time ~/programs/kmergenie-1.6982/kmergenie --one-pass -k 23 -l 23 -e 1 -t 8 -o wtCop_kmergenies_k23_t8 /scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq real 3m34.020s user 3m37.423s sys 0m0.859s  # KmerCountExact - Threads 16 @OHRI-crick ~/analysis/kmercounting/t32$ time ~/programs/bbmap_35.66/kmercountexact.sh t=16 k=23 in=/scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq fastadump=f khist=wtCop_bbtools_k23_t16.hist
real    0m6.227s
user    1m6.747s
sys    0m3.031s

@OHRI-crick ~/analysis/kmercounting/t32 $time jellyfish count -m 23 -s 1000000 -t 16 -C /scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq; mvmer_counts.jf wtCop_jellyfish_k23_t16.jf real 0m15.679s user 3m47.499s sys 0m0.931s # Kmergenie - Threads 16 @OHRI-crick ~/analysis/kmercounting/t32$ time ~/programs/kmergenie-1.6982/kmergenie --one-pass -k 23 -l 23 -e 1 -t 16 -o wtCop_kmergenies_k23_t16 /scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq
real    3m36.166s
user    3m39.676s
sys    0m0.883s


# KmerCountExact - Threads 32
@OHRI-crick ~/analysis/kmercounting/t32 $time ~/programs/bbmap_35.66/kmercountexact.sh t=32 k=23 in=/scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq fastadump=f khist=wtCop_bbtools_k23_t32.hist real 0m6.386s user 1m18.464s sys 0m3.281s # Jellyfish - Threads 32 @OHRI-crick ~/analysis/kmercounting/t32$ time jellyfish count -m 23 -s 1000000 -t 32 -C /scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq; mv mer_counts.jf wtCop_jellyfish_k23_t32.jf
real    0m14.088s
user    5m55.872s
sys    0m1.121s

@OHRI-crick ~/analysis/kmercounting/t32 \$ time ~/programs/kmergenie-1.6982/kmergenie --one-pass -k 23 -l 23 -e 1 -t 32 -o wtCop_kmergenies_k23_t32 /scratch/apelin/wtSeq_VACV/MaxStrict_min50/wtCop_S15_maxstrict.fastq
real    3m34.357s
user    3m37.753s
sys    0m0.896s


Seems like none of these become much faster with increased core availability. All tools have been downloaded prior to the analysis so that the most recent version is used.

2
Entering edit mode

bbmap is showing a histogram only, not giving count per k-mer; jellyfish is dumping count of each k-mer in its own binary format – not comparable. For benchmark purpose alone, it'd better to let them output per k-mer count in text (fastadump=t for bbmap; --text for jellyfish). In addition, java may have a warmup time. Small input favors C/C++ programs. Preallocating hash tables may also have a significant effect (on c. elegans data, -s1G is twice as fast as -s1M with similar peak RAM).

1
Entering edit mode

OK, I've rerun them on a different, 20-core/40-thread machine, and am using the -C flag for jellyfish. Commands:

memtime java -Xmx80g -Xms80g jgi.KmerCountExact in=reads.fq out=dump.fa t=40
memtime jellyfish count -m 31 -s 1000000 -t 40 -C --text reads.fq


Results:

KCE:
703.62 user, 30.33 system, 68.42 elapsed

Jellyfish:
2674.70 user, 14.63 system, 81.86 elapsed


Then, I reran Jellyfish with a larger initial hashtable size of 1 billlion; @lh3's suggestion.

1941.39 user, 14.65 system, 56.29 elapsed


So, in my tests, with dynamic resizing, KmerCountExact is faster. If you have sufficient cores and preallocate so dynamic resizing is not needed, Jellyfish is faster.

1
Entering edit mode

Woah, that's surprising. How can memory reallocation take up so much time? I think that is definitely something Jellyfish could/should tune. Reallocating a very large struct for me takes around a second and only happens every 10-20 minutes (500Mb at a time). Either they are resizing too often with not enough new space, or they are doing something weird like writing the buffer to disk and reading it back in again so the total memory usage doesn't peak during reallocation..?

1
Entering edit mode

They are not using disk. But, it depends on how aggressively you resize - growing by 20% versus doubling, for example, take different amounts of time. And also, resizing in KmerCountExact, for example, requires rehashing everything - that's a lot of random memory accesses (one per kmer), which is much slower than a bulk memory allocation operation. Also, for the duration of reallocation, some part (or maybe all) of the data structure might be locked so threads can't write to it...

1
Entering edit mode

And... I released a new version of KmerCountExact (35.69). Inspired by this thread, I made the kmer dumping multithreaded and have retaken the speed record on that dataset:

826.60 user, 29.97 system, 33.67 elapsed


which beats Jellyfish's 56 seconds even with preallocation.

The specific time breakdown for KCE:

Load Time:                      24.018 seconds.
Kmer Dump Time:                 7.875 seconds.

0
Entering edit mode
6.1 years ago
chongchu.cs ▴ 10

I know kmc2 is fast and memory efficient, but do need a specific k before running.

1
Entering edit mode

No, you need to specify a k-mer size for kmc2 to work. It just has a default k.

0
Entering edit mode

Yeah, same for jellyfish and i think khmer. I think it's because they all use 2bit packing of DNA into binary, and store that in memory in either 32bit or 64bit blocks (which is why they say their tools are optimised for 16bp or 32bp kmers or lower).

If your k-mer length is fixed, a hash table and 2bit encoding is the way to go, but if not, i think you need a whole different approach (like the suffix array, although theres probably better ways also)

1
Entering edit mode

I remember kmc2 use a clever technique called minimizer, and avoid the merging steps like jellyfish doing. It's much faster and memory consuming is also low when I tried it, compared with jellyfish.

0
Entering edit mode

Oh ok, i'll give it a go now :)