How To Extract Information From Fastq Pair End Files
4
0
Entering edit mode
8.4 years ago
Paul ★ 1.4k

dear BioStars users,

I would like to extract from my pair-end fastq files information how many times my read is occurring in my fastq file.

So output could look -

my read (sequence) - how many times I found it in fastq file :

CCGGCTCGC - 140x CTTCGCGCC - 2x

I tried to use awk to comparing all reads to each other, but it does not work very well :-(

Is there any tool or idea how to compare all reads to each other and extract how many times is occurring my reads in fastq file?

Thank you so much for any idea and help! I hope my question is clear..

Paul.

fastq paired-end • 5.0k views
ADD COMMENT
3
Entering edit mode

Have you tried just using FastQC, which does kmer enrichment as part of its normal workflow?

ADD REPLY
0
Entering edit mode

Thank you very much for reply.

I tried FastQC and it works perfectly fine. But their modules - overrepresented sequences is perfect, but any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. And I would like to compare all read length (maybe with a few mismatches).

It is hard to say, if it is possible..

ADD REPLY
0
Entering edit mode

Ah, I hadn't realized that. Other options that you might look into would be this program or the QRQC package in bioconductor, which is supposed to also do k-mer calculations.

ADD REPLY
3
Entering edit mode
8.4 years ago
Pavel Senin ★ 1.9k

wordcount can help in this

$zcat test.fq.gz | head -5000 | seqtk fq2fa - >test.fa

$ wordcount test.fa -wordsize=9

output:

$ head test.wordcount
AACAACAAC    44
CAACAACAA    44
ACAACAACA    44
TCACTCACT    20
CACTCACTC    19
TAGGGTTAG    17
TTAGGGTTA    17
CTCACTCAC    17
GGGTTAGGG    15
GTTAGGGTT    15
ADD COMMENT
0
Entering edit mode

That's a quite nice solution!

ADD REPLY
0
Entering edit mode

Thank you so much for time and help!!

P.

ADD REPLY
0
Entering edit mode
8.4 years ago
Assa Yeroslaviz ★ 1.7k

try this one here. Haven't tried it myself yet, but this is something I also need and will try it soon.

ADD COMMENT
0
Entering edit mode

Thank you for link... I gonna try it and let you know...

ADD REPLY
0
Entering edit mode
8.4 years ago

This has already been answered here C: script to calculate length distribution of a collapsed fasta file

The input you have is fastq which is different from fasta. so modify the solution accordingly.

ADD COMMENT
0
Entering edit mode

But If I am not mistaken, here the fastq file is not collapsed.

ADD REPLY
0
Entering edit mode
8.4 years ago
Perry • 0

I presume you want to count the frequency of the read you specified in your fastq files.

Here is a little perl script (script.pl) you can take use:


#!/usr/bin/perl -w
use strict;
# if your fastq file is gzip compressed, you may need to add PerlIO::gzip module
use PerlIO::gzip;

# usage: perl script.pl <specified_seq> <R1.fastq R2.fastq | R1.fastq.gz R2.fastq.gz>

my $seq = $ARGV[0];
my $count = 0;

foreach my $i (@ARGV[1..$#ARGV]){
    open IN,"<:gzip",$i || open IN,$i || die $!;
    while(<IN>){
        # read in four lines of each read in fastq format file.
        <IN>;
        my $read = <IN>;
        <IN>;<IN>;
        if ($read eq $seq){
            $count++;
        }
    }
    close IN;
}

# output counts
print "$seq occurs $count times\n";

ADD COMMENT

Login before adding your answer.

Traffic: 2566 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6