How To Extract Information From Fastq Pair End Files
4
0
Entering edit mode
8.4 years ago
Paul ★ 1.4k

dear BioStars users,

I would like to extract from my pair-end fastq files information how many times my read is occurring in my fastq file.

So output could look -

my read (sequence) - how many times I found it in fastq file :

CCGGCTCGC - 140x CTTCGCGCC - 2x

I tried to use awk to comparing all reads to each other, but it does not work very well :-(

Is there any tool or idea how to compare all reads to each other and extract how many times is occurring my reads in fastq file?

Thank you so much for any idea and help! I hope my question is clear..

Paul.

fastq paired-end • 5.0k views
3
Entering edit mode

Have you tried just using FastQC, which does kmer enrichment as part of its normal workflow?

0
Entering edit mode

Thank you very much for reply.

I tried FastQC and it works perfectly fine. But their modules - overrepresented sequences is perfect, but any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. And I would like to compare all read length (maybe with a few mismatches).

It is hard to say, if it is possible..

0
Entering edit mode

Ah, I hadn't realized that. Other options that you might look into would be this program or the QRQC package in bioconductor, which is supposed to also do k-mer calculations.

3
Entering edit mode
8.4 years ago
Pavel Senin ★ 1.9k

wordcount can help in this

$zcat test.fq.gz | head -5000 | seqtk fq2fa - >test.fa$ wordcount test.fa -wordsize=9


output:

$head test.wordcount AACAACAAC 44 CAACAACAA 44 ACAACAACA 44 TCACTCACT 20 CACTCACTC 19 TAGGGTTAG 17 TTAGGGTTA 17 CTCACTCAC 17 GGGTTAGGG 15 GTTAGGGTT 15  ADD COMMENT 0 Entering edit mode That's a quite nice solution! ADD REPLY 0 Entering edit mode Thank you so much for time and help!! P. ADD REPLY 0 Entering edit mode 8.4 years ago Assa Yeroslaviz ★ 1.7k try this one here. Haven't tried it myself yet, but this is something I also need and will try it soon. ADD COMMENT 0 Entering edit mode Thank you for link... I gonna try it and let you know... ADD REPLY 0 Entering edit mode 8.4 years ago This has already been answered here C: script to calculate length distribution of a collapsed fasta file The input you have is fastq which is different from fasta. so modify the solution accordingly. ADD COMMENT 0 Entering edit mode But If I am not mistaken, here the fastq file is not collapsed. ADD REPLY 0 Entering edit mode 8.4 years ago Perry • 0 I presume you want to count the frequency of the read you specified in your fastq files. Here is a little perl script (script.pl) you can take use:  #!/usr/bin/perl -w use strict; # if your fastq file is gzip compressed, you may need to add PerlIO::gzip module use PerlIO::gzip; # usage: perl script.pl <specified_seq> <R1.fastq R2.fastq | R1.fastq.gz R2.fastq.gz> my$seq = $ARGV[0]; my$count = 0;

foreach my $i (@ARGV[1..$#ARGV]){
open IN,"<:gzip",$i || open IN,$i || die $!; while(<IN>){ # read in four lines of each read in fastq format file. <IN>; my$read = <IN>;
<IN>;<IN>;
if ($read eq$seq){
$count++; } } close IN; } # output counts print "$seq occurs \$count times\n";