Question: Fastq Quality Read And Score Length Check
4
gravatar for thecuriousbiologist
6.6 years ago by
United States
thecuriousbiologist430 wrote:

I have a huge Fastq file and I want to check if the sequence length and the quality score length is the same for each read. That is, for each read in the Fastq file (a read has 4 lines as we are aware), I want to check if the length of the 2nd line (sequence) and the length of the 4th line (quality score) are the same.

What is the most efficient way to go about it ?

Any suggestions will be greatly appreciated.

sequence fastq length quality • 10k views
ADD COMMENTlink modified 6.6 years ago by Sukhdeep Singh9.7k • written 6.6 years ago by thecuriousbiologist430

Fastq With Format Errors answer to a similar question may be helpful if you do find formatting errors.

ADD REPLYlink written 6.6 years ago by SES8.2k
5
gravatar for Sukhdeep Singh
6.6 years ago by
Sukhdeep Singh9.7k
Netherlands
Sukhdeep Singh9.7k wrote:

The answer to your question is using awk.

See, for counting the read length, use :

awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' file.fastq > readLength

Output:

 2    ATCTCATTTTTCACGTTTTTTAGTGATTTCATAATTTTTCAAGTCGTAAAGTGGATGTTTCTCATTTTTCATGATT    76
 6    GTGTTAGGTGCCAAGATCCTTCTGTGTTTAAAGTGCCTTCAGGGGTGAGTGGTTTTAGTGCCATGTACTGTGGAAA    76
10    TGTATGGAGTGTTTAGATGCTCCCTTCTAGCTTCTTTTCAGACAATTTACTAGGATGATTGACAGAGAATTGCTTT    76
14    ACTGTAGCCCCGCCACACTGCCACTGGTATAGGACACCAAAAAGCACTCTCTAATAAATACAGAAAGATTAGCAGA    76
18    ACGCTGGAAACTTAGACTAGCAGTTAGAATGTAGTGGTTGATAGAGCATCCAGCAACCAGGGGTAAACTCGGCCTC    76
22    TTTTTAGTGATTTCGTCATTTTTCAAGTCGTCAAGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCCTCGCC    76
26    CAGCAAGCGAAATAGTTATGGCCCATGTTTTCTGGATATCAATTACACAACGGGGAGATTTGGAAGGAGAACATGC    76
30    CATTTCACATGAGTGAGACTCAAGGCTTGGGATGTGTAATACAAGATCAACACTTGGGACAAATTCCAGCATGGGG    76

where first column is the line number in the original fastq file, 2nd is the read in light and third the length of this read

and for the quality length, use :

awk '{if(NR%4==0) print NR"\t"$0"\t"length($0)}' file.fastq > qualityLength

 4    1==DDBEEHHHHDHIHIIIIIFHEHAGGHGIHIGHIIIIGGIIIGAFIIIGDGIHIIHCIIIHEHGEIII@HHGC?    76
 8    BBBFFFFFFHHHHHIGIJJIJIJJJIJIJJIJIHHGHIIIIJJJJ?DFHDFHHIIJJJIIIIJJJJJJJHFHGFGC    76
12    ??@D?D?DFFFDFGGIGH@>IHG@FGAA+CFF>9CGH@9DDH@GGICH<B@FHIBEH<GG@BF<BB=CA;F)7@GC    76
16    @@@DDDDDF>FFHIIGIGEGG??BFEC?9CFFE9DDGAGDDDF<CBE@FEHHGGHEDGHGHBA>EEEDCCBCCCCA    76
20    CCCFFFFFGHHHHJJJJJJIHIJHHGJIIIJHIJFHGADHGHGHDGIIIIIIJJIJJJJJJJI@GIHHHHH=BE?@    76
24    BCCFFDFDHHHHHJJJJJJJJJJGIIJJGHIIJIJ?FHCEHDGIJJIJJJJJIJJJJIJJJJIGJIIJJJIIGIGB    76
28    @CCDDFFDHHFHHJFGEGIIIIJAHGEEDGHGICHEEIIJ@HEEGIIJDCHHIIJ>EACEHFF?@DCEDCB>ACCD    76
32    @@@FFDDDHHDDDFHIJGIIGCGGIIIIIJJGHHCF1:FGFGIGGGHHIJJJDHGHIGHIIIGIJIIJIGHGHHHH    76

and finally, use awk again to compare the columns, here the column number 3

awk 'NR==FNR{a[$3]++;next}!a[$3]' readLength qualityLength

This will output the lines where the length is not matched, the output will be from the 2nd file.

FastQC will tell you the overall picture but this will exactly tell you what you want.

HTH

Edit : The code is better now as no need to use diff and to calculate the line number back (as suggested by @Istvan)

ADD COMMENTlink modified 6.5 years ago • written 6.6 years ago by Sukhdeep Singh9.7k
1

just print out the NR variable on each line - that way one would not need to compute it at all

ADD REPLYlink written 6.6 years ago by Istvan Albert ♦♦ 80k

Good suggestion, edited the code and works better now, no need to use diff and line number calculation. Cheers

ADD REPLYlink written 6.6 years ago by Sukhdeep Singh9.7k

Unnecessary use of cat.

ADD REPLYlink written 6.6 years ago by SES8.2k

edited, late though :)

ADD REPLYlink written 6.5 years ago by Sukhdeep Singh9.7k

Please let me know how to plot all read length histogram??

ADD REPLYlink written 5.5 years ago by HG1.1k
3
gravatar for JC
6.6 years ago by
JC7.8k
Mexico
JC7.8k wrote:

simple Perl solution reporting problematic reads:

#!/usr/bin/perl
use strict; 
use warning;
my ($read, $seqLen, $qualLen);
my $nl = 0;
while (<>)  {
    chomp;
    $nl++;
    if ($nl == 1) { 
        $read = $_; 
    }
    elsif ($nl == 2) { 
        $seqLen = length $_; 
    }
    elsif ($nl == 4) { 
        $qualLen = length $_;
        print "$read\t$seqLen != $qualLen\n" if ($seqLen != $qualLen);
        $nl   = 0;
    }
}
ADD COMMENTlink modified 6.6 years ago • written 6.6 years ago by JC7.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 934 users visited in the last hour