Fastq Quality Read And Score Length Check
3
4
Entering edit mode
9.1 years ago

I have a huge Fastq file and I want to check if the sequence length and the quality score length is the same for each read. That is, for each read in the Fastq file (a read has 4 lines as we are aware), I want to check if the length of the 2nd line (sequence) and the length of the 4th line (quality score) are the same.

What is the most efficient way to go about it ?

Any suggestions will be greatly appreciated.

fastq sequence quality length • 13k views
ADD COMMENT
0
Entering edit mode

Fastq With Format Errors answer to a similar question may be helpful if you do find formatting errors.

ADD REPLY
6
Entering edit mode
9.1 years ago

The answer to your question is using awk.

See, for counting the read length, use :

awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' file.fastq > readLength

Output:

 2    ATCTCATTTTTCACGTTTTTTAGTGATTTCATAATTTTTCAAGTCGTAAAGTGGATGTTTCTCATTTTTCATGATT    76
6    GTGTTAGGTGCCAAGATCCTTCTGTGTTTAAAGTGCCTTCAGGGGTGAGTGGTTTTAGTGCCATGTACTGTGGAAA    76
10    TGTATGGAGTGTTTAGATGCTCCCTTCTAGCTTCTTTTCAGACAATTTACTAGGATGATTGACAGAGAATTGCTTT    76
14    ACTGTAGCCCCGCCACACTGCCACTGGTATAGGACACCAAAAAGCACTCTCTAATAAATACAGAAAGATTAGCAGA    76
18    ACGCTGGAAACTTAGACTAGCAGTTAGAATGTAGTGGTTGATAGAGCATCCAGCAACCAGGGGTAAACTCGGCCTC    76
22    TTTTTAGTGATTTCGTCATTTTTCAAGTCGTCAAGTGGATGTTTCTCATTTTCCATGATTTTCAGTTTTCCTCGCC    76
26    CAGCAAGCGAAATAGTTATGGCCCATGTTTTCTGGATATCAATTACACAACGGGGAGATTTGGAAGGAGAACATGC    76
30    CATTTCACATGAGTGAGACTCAAGGCTTGGGATGTGTAATACAAGATCAACACTTGGGACAAATTCCAGCATGGGG    76


where first column is the line number in the original fastq file, 2nd is the read in light and third the length of this read

and for the quality length, use :

awk '{if(NR%4==0) print NR"\t"$0"\t"length($0)}' file.fastq > qualityLength

 4    1==DDBEEHHHHDHIHIIIIIFHEHAGGHGIHIGHIIIIGGIIIGAFIIIGDGIHIIHCIIIHEHGEIII@HHGC?    76
8    BBBFFFFFFHHHHHIGIJJIJIJJJIJIJJIJIHHGHIIIIJJJJ?DFHDFHHIIJJJIIIIJJJJJJJHFHGFGC    76
12    ??@D?D?DFFFDFGGIGH@>IHG@FGAA+CFF>9CGH@9DDH@GGICH<B@FHIBEH<GG@BF<BB=CA;F)7@GC    76
16    @@@DDDDDF>FFHIIGIGEGG??BFEC?9CFFE9DDGAGDDDF<CBE@FEHHGGHEDGHGHBA>EEEDCCBCCCCA    76
20    CCCFFFFFGHHHHJJJJJJIHIJHHGJIIIJHIJFHGADHGHGHDGIIIIIIJJIJJJJJJJI@GIHHHHH=BE?@    76
24    BCCFFDFDHHHHHJJJJJJJJJJGIIJJGHIIJIJ?FHCEHDGIJJIJJJJJIJJJJIJJJJIGJIIJJJIIGIGB    76
28    @CCDDFFDHHFHHJFGEGIIIIJAHGEEDGHGICHEEIIJ@HEEGIIJDCHHIIJ>EACEHFF?@DCEDCB>ACCD    76
32    @@@FFDDDHHDDDFHIJGIIGCGGIIIIIJJGHHCF1:FGFGIGGGHHIJJJDHGHIGHIIIGIJIIJIGHGHHHH    76


and finally, use awk again to compare the columns, here the column number 3

awk 'NR==FNR{a[$3]++;next}!a[$3]' readLength qualityLength

This will output the lines where the length is not matched, the output will be from the 2nd file.

FastQC will tell you the overall picture but this will exactly tell you what you want.

HTH

Edit : The code is better now as no need to use diff and to calculate the line number back (as suggested by @Istvan)

ADD COMMENT
1
Entering edit mode

just print out the NR variable on each line - that way one would not need to compute it at all

ADD REPLY
0
Entering edit mode

Good suggestion, edited the code and works better now, no need to use diff and line number calculation. Cheers

ADD REPLY
0
Entering edit mode

Unnecessary use of cat.

ADD REPLY
0
Entering edit mode

edited, late though :)

ADD REPLY
0
Entering edit mode

Please let me know how to plot all read length histogram??

ADD REPLY
3
Entering edit mode
9.1 years ago
JC 12k

simple Perl solution reporting problematic reads:

#!/usr/bin/perl
use strict;
use warning;
my ($read,$seqLen, $qualLen); my$nl = 0;
while (<>)  {
chomp;
$nl++; if ($nl == 1) {
$read =$_;
}
elsif ($nl == 2) {$seqLen = length $_; } elsif ($nl == 4) {
$qualLen = length$_;
print "$read\t$seqLen != $qualLen\n" if ($seqLen != $qualLen);$nl   = 0;
}
}

ADD COMMENT

Login before adding your answer.

Traffic: 2113 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6