Finding most frequent k-mers in Fastq file
1
0
Entering edit mode
8.2 years ago
murat • 0

Hello everyone,

I have zero knowledge about bioinformatics and I am sorry about if this question comes as oblivious but I've done lots of research and couldn't find an answer.

Let's say I have FASTQ file and I need to find most frequent 25 K-mers (k=30) in this file. What should be the algorithmic approach?

Let's say file is something like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID
GATTTGGGAGTAAATCCATTTGTTCAACTCACAGTTTGTTCAAAGCAGTATCGATCAAAT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

What I do is, I first read the first line after @SEQ_ID and find all possible 30-mers (as substrings) from it, then I move on the second sequence after second @SEQ_ID and find 30-mers from there as well.

However, I couldn't find any information about: Should I concatenate these two strings together and look for k-mers there?

In other words, should I count (for example) last 10 characters of the first line and first 20 characters of second line as a k-mer?

Thank you

kmer sequencing • 4.2k views
ADD COMMENT
2
Entering edit mode
8.2 years ago

You should likely not be concatenating the sequences together. Each read is a substring of a larger DNA template that is either being read from one or both ends during sequencing. If you concatenate the reads together you'll be creating a junction that does not exist in the original DNA sequence.

Regarding kmer counting, I would suggest looking at khmer.

ADD COMMENT
0
Entering edit mode

Thanks for the answer Matt. Any text editor on Mac is just freezing trying to open the output khmer produces. Any ideas?

ADD REPLY
0
Entering edit mode

man less?............

ADD REPLY
0
Entering edit mode

It produces a binary file. Makes no sense.

ADD REPLY
0
Entering edit mode

If you provide information about which script you ran and with what parameters I might be able to help. If you're using the load-into-counting.py script then you need to process the output mer graph (probably your binary file) using the abundance-dist.py script.

ADD REPLY
0
Entering edit mode

After reading the docs and this answer, it's more clear. Thank you very much for the help Matt!

ADD REPLY

Login before adding your answer.

Traffic: 1768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6