Question

Paired End Vs Single End Detection

3

Entering edit mode

12.4 years ago

Lee Katz ★ 3.2k

Hi, I am wondering if anyone has a subroutine or function (hopefully in Perl or pseudocode) to detect whether a fastq file is a shuffled paired-end or not. I think I've seen a few different syntaxes on headers and so I feel like it requires a little more knowledge on their formats than what I have right now.

Yes, I am still looking into it, but I am wondering if this has already been done, to save me time...

illumina paired-end read perl • 7.8k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 12.4 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

Do you mean that you don't have information about the reads in their header?

ADD REPLY • link 12.4 years ago by Arun 2.4k

0

Entering edit mode

I have those deflines, sure, but I am wondering if anyone has already coded all the logic. Looking at the wikipedia page, if it is casava 1.7 then pairs are indicated by \1 or \2 and if it is 1.8 then it is indicated by :1 or :2, etc. There seems to be a few fine points in deciding if it is paired end, and I am wondering if all of these fine points are already coded.

ADD REPLY • link 12.4 years ago by Lee Katz ★ 3.2k

1

Entering edit mode

Edit: I get it now as to what you want. To my knowledge, there are two possibilities, both of which you have already mentioned. The older #0/1 format and the new [12]:[NY]:barcode format. A lazy way to check for which format it is would be to read in the first line and check for the existence of a space in the line. The older format shouldn't have one.

Old answer: I don't know what you mean by "all of these fine points are already coded"? But if it is this you want to check then you can just use awk as follows (for CASAVA 1.8):

awk '(NR+3)%4 == 0 && $0 ~ /1:[NY].*/ {print}' myfile.fq # prints all reads first in your paired fastq file
awk '(NR+3)%4 == 0 && $0 ~ /2:[NY].*/ {print}' myfile.fq # prints all reads second in your paired fastq file

If you don't have 2, then the reads are much likely SE. You can pipe | wc -l to count the number of lines after the awk command to ensure that you have equal reads for both pairs. Is this what you're asking? If not, sorry!

ADD REPLY • link 12.4 years ago by Arun 2.4k

0

Entering edit mode

That's a pretty nifty awk script! But I am indeed looking for something more in-depth. Something that says, ok, the first read and second read in the file are from the same machine, etc but just differ by the member number (1 or 2)

ADD REPLY • link 12.4 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

Does your fastq file contain reads from more than 1 type of sequencer? (i.e., casava 1.8 and casava < 1.8)?

ADD REPLY • link 12.4 years ago by Arun 2.4k

0

Entering edit mode

I have fastq files from each of these sequencers.

ADD REPLY • link 12.4 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

usually, one tends to sequence each pair in different steps, so the usual output would be a pair of fastq files, one file per tag. if at certain point these 2 are merged I would expect to see in the headers codes like F3 and F5 for instance, although I haven't come across any paired-end single fastq file to certify this.

ADD REPLY • link 12.4 years ago by Jorge Amigo 14k

Ram · Answer 1 · 2012-05-30

I think that I have a good answer but I haven't tested it on all of my data. I got a little lazy with the Casava 1.7 subroutine too. I hope it helps anyone else who is in my position, and I hope that anyone who uses it will report any problems with it.

usage: my $isPaired=is_fastqFE("file.fastq",{checkFirst=>20});

#!/usr/bin/env perl

# author: Lee Katz <lkatz@cdc.gov>
# See whether a fastq file is paired end or not. It must be in a velvet-style shuffled format.
# In other words, the left and right sides of a pair follow each other in the file.
# params: fastq file and settings
# fastq file can be gzip'd
# settings:  checkFirst is an integer to check the first X deflines
sub is_fastqPE($;$){
  my($fastq,$settings)=@_;
  $$settings{checkFirst}||=20;

  # it is paired end if it validates with either system
  my $is_pairedEnd=_is_illuminaPECasava18($fastq,$settings) || _is_illuminaPECasava17($fastq,$settings);

  die "paired: $is_pairedEnd";
  return $is_pairedEnd;
}

sub _is_illuminaPECasava18{
  my($fastq,$settings)=@_;
  my $numEntriesToCheck=$$settings{checkFirst}||20;

  my $numEntries=0;
  my $fp;
  if($fastq=~/\.gz$/){
    open($fp,"gunzip -c '$fastq' |") or die "Could not open $fastq for reading: $!";
  }else{
    open($fp,"<",$fastq) or die "Could not open $fastq for reading: $!";
  }
  while(<$fp>){
    chomp;
    s/^@//;
    #my($flowcell,$tile,$x,$y,$index,$member)= # wrong: this is an older version of illumina and will not be used here
    my($instrument,$runid,$flowcellid,$lane,$tile,$x,$yandmember,$is_failedRead,$controlBits,$indexSequence)=split(/:/,$_);
    my $discard;
    $discard=<$fp> for(1..3); # discard the sequence and quality of the read for these purposes
    my($y,$member)=split(/\s+/,$yandmember);

    # if all information is the same, except the member (1 to 2), then it is still paired until this point.
    my $secondId=<$fp>;
    chomp $secondId;
    $secondId=~s/^@//;
    my($inst2,$runid2,$fcid2,$lane2,$tile2,$x2,$yandmember2,$is_failedRead2,$controlBits2,$indexSequence2)=split(/:/,$secondId);
    $discard=<$fp> for(1..3); # discard the sequence and quality of the read for these purposes
    my($y2,$member2)=split(/\s+/,$yandmember2);

    if($instrument ne $inst2 || $runid ne $runid2 || $flowcellid ne $fcid2 || $tile ne $tile2 || $member!=1 || $member2!=2){
      #logmsg "Failed!\n$instrument,$runid,$flowcellid,$lane,$tile,$x,$yandmember,$is_failedRead,$controlBits,$indexSequence\n$inst2,$runid2,$fcid2,$lane2,$tile2,$x2,$yandmember2,$is_failedRead2,$controlBits2,$indexSequence2\n";
      close $fp;
      return 0;
    }

    $numEntries+=2;
    last if($numEntries>$numEntriesToCheck);
  }

  close $fp;

  return 1;
}

sub _is_illuminaPECasava17{
  my($fastq,$settings)=@_;
  # 20 reads is probably enough to make sure that it's shuffled
  my $numEntriesToCheck=$$settings{checkFirst}||20;
  my $numEntries=0;
  my $fp;
  if($fastq=~/\.gz$/){
    open($fp,"gunzip -c '$fastq' |") or die "Could not open $fastq for reading: $!";
  }else{
    open($fp,"<",$fastq) or die "Could not open $fastq for reading: $!";
  }
  while(my $read1Id=<$fp>){
    my $discard;
    $discard=<$fp> for(1..3);
    my $read2Id=<$fp>;
    $discard=<$fp> for(1..3);

    if($read1Id!~/\/1$/ || $read2Id!~/\/2$/){
      close $fp;
      return 0;
    }

    $numEntries+=2;
    last if($numEntries>=$numEntriesToCheck);
  }
  close $fp;

  return 1;
}

Ram · Answer 2 · 2015-01-14

2

Entering edit mode

9.8 years ago

Brian Bushnell 20k

I don't really understand why ancient threads get randomly bumped to the top by Biostar, but since it's here, I'll share my program for file format detection, which is fairly robust. It detects file format (fasta, fastq, sam, scarf), interleaved vs non-interleaved, and quality encoding.

testformat.sh reads.fastq

http://sourceforge.net/projects/bbmap/

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.8 years ago by Brian Bushnell 20k

2

Entering edit mode

There's a background script that will randomly bump posts. This is from Istvan in another thread (on the Galaxy instance of Biostars, since I can't find the one time he mentioned this here):

There is a so called "bump" feature that people on Biostars have requested. It is to remind them of older posts and freshen the view during slower days. The bumping takes place every six hours. It has a 50% chance to choose either one random question from the entire corpus or one unanswered random question from the past 10 weeks.

This latter does not take into account the accepted state, just whether the post has an answer or not. In general only people familiar with Q&A concepts do accept answers, it is not the most obvious interaction especially for newcomers.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Devon Ryan 104k

Ram · Answer 3 · 2012-05-31

Hi

I have a simple and reasonable idea.

1) if you have the reference.

mapping the reads back to the reference, calculate the insert distances between odd line number and even line number by catching the start mapping positions. plot all the insert size, if you do see it in a certain normal distribution then yes.

2) if you don't have a reference.

Do a very simple assembly using your "single end " reads, get the draft contigs. do the same as above.

This idea will bypass examining the header information step.

Good luck