I think that I have a good answer but I haven't tested it on all of my data. I got a little lazy with the Casava 1.7 subroutine too. I hope it helps anyone else who is in my position, and I hope that anyone who uses it will report any problems with it.
usage: my $isPaired=is_fastqFE("file.fastq",{checkFirst=>20});
#!/usr/bin/env perl
# author: Lee Katz <lkatz@cdc.gov>
# See whether a fastq file is paired end or not. It must be in a velvet-style shuffled format.
# In other words, the left and right sides of a pair follow each other in the file.
# params: fastq file and settings
# fastq file can be gzip'd
# settings: checkFirst is an integer to check the first X deflines
sub is_fastqPE($;$){
my($fastq,$settings)=@_;
$$settings{checkFirst}||=20;
# it is paired end if it validates with either system
my $is_pairedEnd=_is_illuminaPECasava18($fastq,$settings) || _is_illuminaPECasava17($fastq,$settings);
die "paired: $is_pairedEnd";
return $is_pairedEnd;
}
sub _is_illuminaPECasava18{
my($fastq,$settings)=@_;
my $numEntriesToCheck=$$settings{checkFirst}||20;
my $numEntries=0;
my $fp;
if($fastq=~/\.gz$/){
open($fp,"gunzip -c '$fastq' |") or die "Could not open $fastq for reading: $!";
}else{
open($fp,"<",$fastq) or die "Could not open $fastq for reading: $!";
}
while(<$fp>){
chomp;
s/^@//;
#my($flowcell,$tile,$x,$y,$index,$member)= # wrong: this is an older version of illumina and will not be used here
my($instrument,$runid,$flowcellid,$lane,$tile,$x,$yandmember,$is_failedRead,$controlBits,$indexSequence)=split(/:/,$_);
my $discard;
$discard=<$fp> for(1..3); # discard the sequence and quality of the read for these purposes
my($y,$member)=split(/\s+/,$yandmember);
# if all information is the same, except the member (1 to 2), then it is still paired until this point.
my $secondId=<$fp>;
chomp $secondId;
$secondId=~s/^@//;
my($inst2,$runid2,$fcid2,$lane2,$tile2,$x2,$yandmember2,$is_failedRead2,$controlBits2,$indexSequence2)=split(/:/,$secondId);
$discard=<$fp> for(1..3); # discard the sequence and quality of the read for these purposes
my($y2,$member2)=split(/\s+/,$yandmember2);
if($instrument ne $inst2 || $runid ne $runid2 || $flowcellid ne $fcid2 || $tile ne $tile2 || $member!=1 || $member2!=2){
#logmsg "Failed!\n$instrument,$runid,$flowcellid,$lane,$tile,$x,$yandmember,$is_failedRead,$controlBits,$indexSequence\n$inst2,$runid2,$fcid2,$lane2,$tile2,$x2,$yandmember2,$is_failedRead2,$controlBits2,$indexSequence2\n";
close $fp;
return 0;
}
$numEntries+=2;
last if($numEntries>$numEntriesToCheck);
}
close $fp;
return 1;
}
sub _is_illuminaPECasava17{
my($fastq,$settings)=@_;
# 20 reads is probably enough to make sure that it's shuffled
my $numEntriesToCheck=$$settings{checkFirst}||20;
my $numEntries=0;
my $fp;
if($fastq=~/\.gz$/){
open($fp,"gunzip -c '$fastq' |") or die "Could not open $fastq for reading: $!";
}else{
open($fp,"<",$fastq) or die "Could not open $fastq for reading: $!";
}
while(my $read1Id=<$fp>){
my $discard;
$discard=<$fp> for(1..3);
my $read2Id=<$fp>;
$discard=<$fp> for(1..3);
if($read1Id!~/\/1$/ || $read2Id!~/\/2$/){
close $fp;
return 0;
}
$numEntries+=2;
last if($numEntries>=$numEntriesToCheck);
}
close $fp;
return 1;
}
Do you mean that you don't have information about the reads in their header?
I have those deflines, sure, but I am wondering if anyone has already coded all the logic. Looking at the wikipedia page, if it is casava 1.7 then pairs are indicated by \1 or \2 and if it is 1.8 then it is indicated by :1 or :2, etc. There seems to be a few fine points in deciding if it is paired end, and I am wondering if all of these fine points are already coded.
Edit:
I get it now as to what you want. To my knowledge, there are two possibilities, both of which you have already mentioned. The older#0/1
format and the new[12]:[NY]:barcode
format. A lazy way to check for which format it is would be to read in the first line and check for the existence of aspace
in the line. The older format shouldn't have one.Old answer:
I don't know what you mean by "all of these fine points are already coded"? But if it is this you want to check then you can just use awk as follows (for CASAVA 1.8):If you don't have 2, then the reads are much likely SE. You can pipe
| wc -l
to count the number of lines after the awk command to ensure that you have equal reads for both pairs. Is this what you're asking? If not, sorry!That's a pretty nifty awk script! But I am indeed looking for something more in-depth. Something that says, ok, the first read and second read in the file are from the same machine, etc but just differ by the member number (1 or 2)
Does your fastq file contain reads from more than 1 type of sequencer? (i.e., casava 1.8 and casava < 1.8)?
I have fastq files from each of these sequencers.
usually, one tends to sequence each pair in different steps, so the usual output would be a pair of fastq files, one file per tag. if at certain point these 2 are merged I would expect to see in the headers codes like F3 and F5 for instance, although I haven't come across any paired-end single fastq file to certify this.