Question: K-Mer Counting From Single-End, Paired-End And Mate-Pair Reads
gravatar for Alice
5.4 years ago by
Alice270 wrote:

Hello, biostars!

I had one small issue, which grew up in a big question.

My issue is following: i'm going to count k-mers and it's frequency (may be depth) on mouse wgs by Illumina (in jellyfish software). Is it correct to combine 2 fastq files into one just by bash 'cut' command? And is it wise to state '-C' flag in jellyfish? Or i need to filter reads by quality before counting k-mers?

If i will do the same on SE reads, will i get similar or identical results? I understand, that for the genome it does not matter PE or SE we use. Correct me if i'm wrong, but as i think, between SE, PE and MP reads there will be difference in unique k-mers, because of ngs technology procedures? Thus, for k-mer counting the comparison of uniqueness value from SE and PE technologies is not correct.

ngs • 5.1k views
ADD COMMENTlink modified 5.4 years ago by JacobS890 • written 5.4 years ago by Alice270

I'm interested in this topic because I've been writing perl script for kmer counting recently. Can you please explain why you are wanting to count kmer frequency in NGS reads? I can share my script for easy kmer counting if you'd like, and if it would be helpful for your situation.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by JacobS890

Firstly why do you want to mix two files ? Keep them single and check the k-mers individually and compare it after combining if needed. I think PE or SE doesn't matter much why can't you use more handy FASTQC tool for the K-mer counting i think it would be better if you use that

ADD REPLYlink written 5.4 years ago by ancient_learner610

Good idea, may be i will do that, thank you!

ADD REPLYlink written 5.4 years ago by Alice270
gravatar for JacobS
5.4 years ago by
Cleveland, Ohio
JacobS890 wrote:

First, your simplest question: combine FASTQ files using cat instead of cut

 cat sample1.fastq sample2.fastq sample3.fastq > total.fastq

Second, I haven't used jellyfish or trinity very much at all, so maybe someone else can answer your question about the -C option. Same goes to whether you'll get different frequencies when counting kmers for SE vs PE vs MP reads. I would expect that the frequencies will be in the same ratio, but perhaps the best way to answer this question is empirically! Why don't you quickly count and review the kmer frequencies of R1/R2/SE/PE/MP and then discuss the results?

Here is a perl script I wrote for looking at frequencies of specific size kmers to get an idea about common kmers and read sequence complexity:

Give this a try and see how your files differ. If you need to check the metrics for all possible kmers instead of individual kmers, let me know and I can modify my script! Good luck.

Here is an example of my script's output using kmer size 7:

Used to find kmer size 7 metrics
Reads file: 1M.fastq

Total number of reads: 1000000
Total number of kmers: 17618

Top 10 kmers combined:       7898965/45000000:  17.55% of all kmers
Top 100 kmers combined:      22211648/45000000:  49.36% of all kmers
Top 1000 kmers combined:     34938588/45000000:  77.64% of all kmers
Top 10000 kmers combined:    43949053/45000000:  97.66% of all kmers

Most common 5 kmers:
863948  TGGAATT
783133  TTCTCGG
729782  CGGGTGC
456897  AAGGAAC
429322  GATTCAG

Least common 5 kmers:
1       NCGCTGG
1       NTCGAAC
1       CAGTCNN
1       GAACNCN
1       NCCCGAT
ADD COMMENTlink written 5.4 years ago by JacobS890

Thank you! Very useful script. I'm not familiar with perl (i prefer python usually), but i will try it!

ADD REPLYlink written 5.4 years ago by Alice270

Great, hopefully you'll like it. If you aren't too familiar with perl, no problem... download the linked file into your UNIX environment using something like wget, and then run the script like this: perl my_sample.fastq 7 (this checks for kmers of size 7)

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by JacobS890

Jacob, I counted kmers by your script and get interesting results

First, many N in HiSeq reads, I guess, for kmer counting I should delete them. Second: 951272 ACACACA 865399 CACACAC vs. 893841 ACACACA 877559 CTGCTTG Two different files and on the second place 2 different words. In my opinion it is the sign that i need wisely concatenate two files, not just by "cat".

ADD REPLYlink written 5.4 years ago by Alice270
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 773 users visited in the last hour