Question

How To Extract Information From Fastq Pair End Files

0

Entering edit mode

10.3 years ago

Paul ★ 1.5k

dear BioStars users,

I would like to extract from my pair-end fastq files information how many times my read is occurring in my fastq file.

So output could look -

my read (sequence) - how many times I found it in fastq file :

CCGGCTCGC - 140x CTTCGCGCC - 2x

I tried to use awk to comparing all reads to each other, but it does not work very well :-(

Is there any tool or idea how to compare all reads to each other and extract how many times is occurring my reads in fastq file?

Thank you so much for any idea and help! I hope my question is clear..

Paul.

fastq paired-end • 5.8k views

ADD COMMENT • link updated 10.3 years ago by Perry • 0 • written 10.3 years ago by Paul ★ 1.5k

3

Entering edit mode

Have you tried just using FastQC, which does kmer enrichment as part of its normal workflow?

ADD REPLY • link 10.3 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you very much for reply.

I tried FastQC and it works perfectly fine. But their modules - overrepresented sequences is perfect, but any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. And I would like to compare all read length (maybe with a few mismatches).

It is hard to say, if it is possible..

ADD REPLY • link 10.3 years ago by Paul ★ 1.5k

0

Entering edit mode

Ah, I hadn't realized that. Other options that you might look into would be this program or the QRQC package in bioconductor, which is supposed to also do k-mer calculations.

ADD REPLY • link 10.3 years ago by Devon Ryan 104k

score 3 · Answer 1 · 2014-01-10

3

Entering edit mode

10.3 years ago

Pavel Senin ★ 1.9k

wordcount can help in this

$zcat test.fq.gz | head -5000 | seqtk fq2fa - >test.fa

$ wordcount test.fa -wordsize=9

output:

$ head test.wordcount
AACAACAAC    44
CAACAACAA    44
ACAACAACA    44
TCACTCACT    20
CACTCACTC    19
TAGGGTTAG    17
TTAGGGTTA    17
CTCACTCAC    17
GGGTTAGGG    15
GTTAGGGTT    15

ADD COMMENT • link 10.3 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

That's a quite nice solution!

ADD REPLY • link 10.3 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you so much for time and help!!

P.

ADD REPLY • link 10.3 years ago by Paul ★ 1.5k

score 0 · Answer 2 · 2014-01-09

0

Entering edit mode

10.3 years ago

Assa Yeroslaviz ★ 1.8k

try this one here. Haven't tried it myself yet, but this is something I also need and will try it soon.

ADD COMMENT • link 10.3 years ago by Assa Yeroslaviz ★ 1.8k

0

Entering edit mode

Thank you for link... I gonna try it and let you know...

ADD REPLY • link 10.3 years ago by Paul ★ 1.5k

score 0 · Answer 3 · 2014-01-09

0

Entering edit mode

10.3 years ago

Ashutosh Pandey 12k

This has already been answered here C: script to calculate length distribution of a collapsed fasta file

The input you have is fastq which is different from fasta. so modify the solution accordingly.

ADD COMMENT • link 10.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

But If I am not mistaken, here the fastq file is not collapsed.

ADD REPLY • link 10.3 years ago by Assa Yeroslaviz ★ 1.8k

score 0 · Answer 4 · 2014-01-10

I presume you want to count the frequency of the read you specified in your fastq files.

Here is a little perl script (script.pl) you can take use:


#!/usr/bin/perl -w
use strict;
# if your fastq file is gzip compressed, you may need to add PerlIO::gzip module
use PerlIO::gzip;

# usage: perl script.pl <specified_seq> <R1.fastq R2.fastq | R1.fastq.gz R2.fastq.gz>

my $seq = $ARGV[0];
my $count = 0;

foreach my $i (@ARGV[1..$#ARGV]){
    open IN,"<:gzip",$i || open IN,$i || die $!;
    while(<IN>){
        # read in four lines of each read in fastq format file.
        <IN>;
        my $read = <IN>;
        <IN>;<IN>;
        if ($read eq $seq){
            $count++;
        }
    }
    close IN;
}

# output counts
print "$seq occurs $count times\n";