How to count frequency of kmers for different values of k from fasta file
2
0
Entering edit mode
17 months ago
Ashi ▴ 20

Hi is there any way to count the frequency of kmers from fasta files for different values of k? My objective is to produce a plot of the kmer distribution. Thank you so much.

fasta kmers count frequency genome • 1.3k views
0
Entering edit mode
0
Entering edit mode

@ATpoint Thank you for these. I had seen these. For some reason I am failing to install Jellyfish on my linux server. it is saying command not found whenever I am giving jellyfish command. kmer-counter is working but it is pretty slow. For ecoli short reads it took more than 6 hours and the server where I run the program can run for 6 hours at a stretch so the job got killed before producing any result. That's why I was looking for some other tool or script maybe awk or anythig.

0
Entering edit mode
17 months ago
Mensur Dlakic ★ 20k

This one is fast and has lots of other functions in addition to k-mer counting:

https://github.com/dib-lab/khmer

0
Entering edit mode
17 months ago
GenoMax 117k

From BBMap suite.

$kmercountexact.sh Written by Brian Bushnell Last modified November 29, 2018 Description: Counts the number of unique kmers in a file. Generates a kmer frequency histogram and genome size estimate (in peaks output), and prints a file containing all kmers and their counts. Supports K=1 to infinity, though not all values are allowed. SEE ALSO: bbnorm.sh/khist.sh loglog.sh, and kmercountmulti.sh. Usage: kmercountexact.sh in=<file> khist=<file> peaks=<file> Input may be fasta or fastq, compressed or uncompressed. Output may be stdout or a file. out, khist, and peaks are optional.  or $ kmercountmulti.sh

Written by Brian Bushnell

Description:  Estimates cardinality of unique kmers in sequence data.
Processes multiple kmer lengths simultaneously to produce a histogram.

Usage:  kmercountmulti.sh in=<file> sweep=<20,100,8> out=<histogram output>

0
Entering edit mode

@GenoMax Is this from BBmaptool? Do I need to install BBmap for this?

0
Entering edit mode

Yes. There is no installation needed. Just download the software and unzip the file. You are ready to go as long as you have a java run time installed.

0
Entering edit mode

@Genomax I know this is not the proper post to ask this so I can delete it if needed but I am having this error when I tried to run kmercountermulti.sh on my linux server

/bbmap//calcmem.sh: line 75: [: -v: unary operator expected
java -ea -Xmx500m -cp /home/bbmap/current/ jgi.KmerCountMulti in=/home/eColi_LR_SL.fasta sweep=16,32,64 out=hist1_eColi_LR
Executing jgi.KmerCountMulti [in=/home/eColi_LR_SL.fasta, sweep=16,32,64, out=hist1_eColi_LR]

Exception in thread "main" java.lang.ArrayStoreException: cardinality.LogLog2
at cardinality.MultiLogLog.<init>(MultiLogLog.java:30)
at cardinality.MultiLogLog.<init>(MultiLogLog.java:11)
at jgi.KmerCountMulti.<init>(KmerCountMulti.java:149)
at jgi.KmerCountMulti.main(KmerCountMulti.java:49)


My java version is :

openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-b10)
OpenJDK 64-Bit Server VM (build 25.272-b10, mixed mode)

0
Entering edit mode

Just tried with E. coli genome.

\$ kmercountmulti.sh in=GCF_000005845.2_ASM584v2_genomic.fna sweep=20,100,8

Input is being processed as unpaired

#K  Count
20  4645838
28  4483466
36  4645838
44  4577093
52  4541564
60  4365048
66  4597298
75  4669509
84  4475877
90  4626988
100 4387290

Time:                           1.603 seconds.
Bases Processed:       4641k    2.90m bases/sec

0
Entering edit mode

Unfortunately I am having the same error. Did you do any modifications to the calcmem.sh as it is showing some error at the 75th line?

0
Entering edit mode

No I did not make any changes. In the sweep= option use the order start,stop,step (in my example go from 20 to 100 in steps of 8). Wonder if you have an extra comma at the end of sweep=16,32,64,.

0
Entering edit mode

No I do not have any extra comma there:

./kmercountmulti.sh in=/home/eColi_LR_SL.fasta sweep=16,64,16 out=hist1_eColi_LR


This is the command I am using and getting the error for. This is a fasta file for long reads of e Coli that I have simulated. I am wondering if the error if because of the input then.

0
Entering edit mode

Possibly. How did you simulate the reads?