Hi is there any way to count the frequency of kmers from fasta files for different values of k? My objective is to produce a plot of the kmer distribution. Thank you so much.
This one is fast and has lots of other functions in addition to k-mer counting:
From BBMap suite.
$ kmercountexact.sh
Written by Brian Bushnell
Last modified November 29, 2018
Description: Counts the number of unique kmers in a file.
Generates a kmer frequency histogram and genome size estimate (in peaks output),
and prints a file containing all kmers and their counts.
Supports K=1 to infinity, though not all values are allowed.
SEE ALSO: bbnorm.sh/khist.sh loglog.sh, and kmercountmulti.sh.
Usage: kmercountexact.sh in=<file> khist=<file> peaks=<file>
Input may be fasta or fastq, compressed or uncompressed.
Output may be stdout or a file. out, khist, and peaks are optional.
or
$ kmercountmulti.sh
Written by Brian Bushnell
Last modified December 30, 2016
Description: Estimates cardinality of unique kmers in sequence data.
Processes multiple kmer lengths simultaneously to produce a histogram.
Usage: kmercountmulti.sh in=<file> sweep=<20,100,8> out=<histogram output>
@GenoMax Is this from BBmaptool? Do I need to install BBmap for this?
Yes. There is no installation needed. Just download the software and unzip the file. You are ready to go as long as you have a java run time installed.
@Genomax I know this is not the proper post to ask this so I can delete it if needed but I am having this error when I tried to run kmercountermulti.sh on my linux server
/bbmap//calcmem.sh: line 75: [: -v: unary operator expected
java -ea -Xmx500m -cp /home/bbmap/current/ jgi.KmerCountMulti in=/home/eColi_LR_SL.fasta sweep=16,32,64 out=hist1_eColi_LR
Executing jgi.KmerCountMulti [in=/home/eColi_LR_SL.fasta, sweep=16,32,64, out=hist1_eColi_LR]
Exception in thread "main" java.lang.ArrayStoreException: cardinality.LogLog2
at cardinality.MultiLogLog.<init>(MultiLogLog.java:30)
at cardinality.MultiLogLog.<init>(MultiLogLog.java:11)
at jgi.KmerCountMulti.<init>(KmerCountMulti.java:149)
at jgi.KmerCountMulti.main(KmerCountMulti.java:49)
My java version is :
openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-b10)
OpenJDK 64-Bit Server VM (build 25.272-b10, mixed mode)
Just tried with E. coli genome.
$ kmercountmulti.sh in=GCF_000005845.2_ASM584v2_genomic.fna sweep=20,100,8
Input is being processed as unpaired
#K Count
20 4645838
28 4483466
36 4645838
44 4577093
52 4541564
60 4365048
66 4597298
75 4669509
84 4475877
90 4626988
100 4387290
Time: 1.603 seconds.
Reads Processed: 1 0.00k reads/sec
Bases Processed: 4641k 2.90m bases/sec
Unfortunately I am having the same error. Did you do any modifications to the calcmem.sh as it is showing some error at the 75th line?
No I did not make any changes. In the sweep=
option use the order start,stop,step
(in my example go from 20 to 100 in steps of 8). Wonder if you have an extra comma at the end of sweep=16,32,64,
.
No I do not have any extra comma there:
./kmercountmulti.sh in=/home/eColi_LR_SL.fasta sweep=16,64,16 out=hist1_eColi_LR
This is the command I am using and getting the error for. This is a fasta file for long reads of e Coli that I have simulated. I am wondering if the error if because of the input then.
Finding Kmers From Fasta File
Counting k-mers in a multifasta file
specific k-mer count in multi fasta file
It is three times the same tools being suggested though :)
@ATpoint Thank you for these. I had seen these. For some reason I am failing to install Jellyfish on my linux server. it is saying command not found whenever I am giving jellyfish command. kmer-counter is working but it is pretty slow. For ecoli short reads it took more than 6 hours and the server where I run the program can run for 6 hours at a stretch so the job got killed before producing any result. That's why I was looking for some other tool or script maybe awk or anythig.