Question

Fastq Quality Read And Score Length Check

5

Entering edit mode

11.5 years ago

thecuriousbiologist ▴ 550

I have a huge Fastq file and I want to check if the sequence length and the quality score length is the same for each read. That is, for each read in the Fastq file (a read has 4 lines as we are aware), I want to check if the length of the 2nd line (sequence) and the length of the 4th line (quality score) are the same.

What is the most efficient way to go about it ?

Any suggestions will be greatly appreciated.

fastq sequence quality length • 16k views

ADD COMMENT • link updated 11.5 years ago by Sukhi Singh 11k • written 11.5 years ago by thecuriousbiologist ▴ 550

0

Entering edit mode

Fastq With Format Errors answer to a similar question may be helpful if you do find formatting errors.

ADD REPLY • link 11.5 years ago by SES 8.6k

score 7 · Answer 1 · 2012-11-01

The answer to your question is using awk.

See, for counting the read length, use :

awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' file.fastq > readLength

Output:

 2    ATCTCATTTTTCACGTTTTTTAGTGATTTCATAATTTTTCAAGTCGTAAAGTGGATGTTTCTCATTTTTCATGATT    76
 6    GTGTTAGGTGCCAAGATCCTTCTGTGTTTAAAGTGCCTTCAGGGGTGAGTGGTTTTAGTGCCATGTACTGTGGAAA    76
10    TGTATGGAGTGTTTAGATGCTCCCTTCTAGCTTCTTTTCAGACAATTTACTAGGATGATTGACAGAGAATTGCTTT    76
14    ACTGTAGCCCCGCCACACTGCCACTGGTATAGGACACCAAAAAGCACTCTCTAATAAATACAGAAAGATTAGCAGA    76
18    ACGCTGGAAACTTAGACTAGCAGTTAGAATGTAGTGGTTGATAGAGCATCCAGCAACCAGGGGTAAACTCGGCCTC    76
22    TTTTTAGTGATTTCGTCATTTTTCAAGTCGTCAAGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCCTCGCC    76
26    CAGCAAGCGAAATAGTTATGGCCCATGTTTTCTGGATATCAATTACACAACGGGGAGATTTGGAAGGAGAACATGC    76
30    CATTTCACATGAGTGAGACTCAAGGCTTGGGATGTGTAATACAAGATCAACACTTGGGACAAATTCCAGCATGGGG    76

where first column is the line number in the original fastq file, 2nd is the read in light and third the length of this read

and for the quality length, use :

awk '{if(NR%4==0) print NR"\t"$0"\t"length($0)}' file.fastq > qualityLength

 4    1==DDBEEHHHHDHIHIIIIIFHEHAGGHGIHIGHIIIIGGIIIGAFIIIGDGIHIIHCIIIHEHGEIII@HHGC?    76
 8    BBBFFFFFFHHHHHIGIJJIJIJJJIJIJJIJIHHGHIIIIJJJJ?DFHDFHHIIJJJIIIIJJJJJJJHFHGFGC    76
12    ??@D?D?DFFFDFGGIGH@>IHG@FGAA+CFF>9CGH@9DDH@GGICH<B@FHIBEH<GG@BF<BB=CA;F)7@GC    76
16    @@@DDDDDF>FFHIIGIGEGG??BFEC?9CFFE9DDGAGDDDF<CBE@FEHHGGHEDGHGHBA>EEEDCCBCCCCA    76
20    CCCFFFFFGHHHHJJJJJJIHIJHHGJIIIJHIJFHGADHGHGHDGIIIIIIJJIJJJJJJJI@GIHHHHH=BE?@    76
24    BCCFFDFDHHHHHJJJJJJJJJJGIIJJGHIIJIJ?FHCEHDGIJJIJJJJJIJJJJIJJJJIGJIIJJJIIGIGB    76
28    @CCDDFFDHHFHHJFGEGIIIIJAHGEEDGHGICHEEIIJ@HEEGIIJDCHHIIJ>EACEHFF?@DCEDCB>ACCD    76
32    @@@FFDDDHHDDDFHIJGIIGCGGIIIIIJJGHHCF1:FGFGIGGGHHIJJJDHGHIGHIIIGIJIIJIGHGHHHH    76

and finally, use awk again to compare the columns, here the column number 3

awk 'NR==FNR{a[$3]++;next}!a[$3]' readLength qualityLength

This will output the lines where the length is not matched, the output will be from the 2nd file.

FastQC will tell you the overall picture but this will exactly tell you what you want.

HTH

Edit : The code is better now as no need to use diff and to calculate the line number back (as suggested by @Istvan)

score 4 · Answer 2 · 2012-11-01

simple Perl solution reporting problematic reads:

#!/usr/bin/perl
use strict; 
use warning;
my ($read, $seqLen, $qualLen);
my $nl = 0;
while (<>)  {
    chomp;
    $nl++;
    if ($nl == 1) { 
        $read = $_; 
    }
    elsif ($nl == 2) { 
        $seqLen = length $_; 
    }
    elsif ($nl == 4) { 
        $qualLen = length $_;
        print "$read\t$seqLen != $qualLen\n" if ($seqLen != $qualLen);
        $nl   = 0;
    }
}