k-mer counters - presence/absence matrix
2
0
Entering edit mode
16 months ago
lizabe ▴ 10

Hi all, I need to compute a presence/absence matrix (binary) of k-mers present in a set of genomes (fasta files).

Could you please suggest me a tool? I have tried Jellyfish but the --matrix option which is described in the tutorial (https://raw.githubusercontent.com/gmarcais/Jellyfish/master/doc/jellyfish.pdf) didn´t work.

Thanks.

k-mers matrix • 1.3k views
0
Entering edit mode

Can you elaborate on "didn't work"? Jellyfish was the first thing that sprang to mind reading your title, so I would say its probably worth persisting with since its one of if not the best tools for kmer stuff.

0
Entering edit mode

Hi Joe, thanks for your answer. Sorry I didn´t explain my problem in the first message.

I installed jellyfish 2.3.0 and ran the command:

jellyfish count -m 256 -o jellyoutput -c 1 -s 100000000 -t 32 --matrix file.fasta

This was the error: count: unrecognized option '--matrix' Use --usage or --help for some help

The tutorial probably corresponds to an old version of the program. Do you know what is the correct command to generate a matrix like the one I need?

0
Entering edit mode

I am not entirely sure if Jellyfish can be readily used to carry out such comparative analysis. You may have to generate k-mer profiles for each sample/genome and then carry out comparisons separately.

0
Entering edit mode

4
Entering edit mode
16 months ago
Rob 5.6k

Hi lizabe,

You're right that this tutorial is out of date. The --matrix option is no longer valid as an option to jellyfish count. However, I don't think it's original intent was to do what you wanted anyway. It doesn't write out a binary presence/absence matrix. Rather, it specifies the binary matrix that is used to generate the universal hash function for hashing the k-mers. Jellyfish relies on a universal hash function, which can be generated using a random binary matrix. If you want to use the exact same hash function for other purposes, you need to know what that matrix is.

Anyway, to achieve what you want, I'm afraid you'll need to take a different approach. Essentially, what you want to do is to count k-mers in a collection of different fasta files / genomes, and then determine which k-mers are present in each. With jellyfish, you could do this by running jellyfish separately on each input genome, then using the dump command to get the k-mer list for each in plain text, and then merging across the files to get the matrix. Alternatively you could use a tool like mantis (disclosure; I'm a senior author of this method) or metagraph that are designed explicitly to be able to answer k-mer presence/absence queries over a large collection of k-mers coming from different sources (among other things).

0
Entering edit mode

Thanks Rob!

2
Entering edit mode
16 months ago

Perhaps kmer-counter or kmer-boolean would be of use for kmers shorter than 31 characters:

The kmer-counter repo contains a script to demonstrate Python integration for quick filtering/querying. You could easily write out a presence/absence matrix from this result.

For kmers that are 32 characters and longer, a tool like Jellyfish would be appropriate.

0
Entering edit mode

Thanks Alex!

Traffic: 1562 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.