Question: Paired End Vs Single End Detection
3
gravatar for Lee Katz
7.2 years ago by
Lee Katz2.9k
Atlanta, GA
Lee Katz2.9k wrote:

Hi, I am wondering if anyone has a subroutine or function (hopefully in Perl or pseudocode) to detect whether a fastq file is a shuffled paired-end or not. I think I've seen a few different syntaxes on headers and so I feel like it requires a little more knowledge on their formats than what I have right now.

Yes, I am still looking into it, but I am wondering if this has already been done, to save me time...

perl illumina paired-end read • 4.0k views
ADD COMMENTlink modified 4.6 years ago by Brian Bushnell16k • written 7.2 years ago by Lee Katz2.9k

Do you mean that you don't have information about the reads in their header?

ADD REPLYlink written 7.2 years ago by Arun2.3k

I have those deflines, sure, but I am wondering if anyone has already coded all the logic. Looking at the wikipedia page, if it is casava 1.7 then pairs are indicated by \1 or \2 and if it is 1.8 then it is indicated by :1 or :2, etc. There seems to be a few fine points in deciding if it is paired end, and I am wondering if all of these fine points are already coded.

ADD REPLYlink written 7.2 years ago by Lee Katz2.9k
1

Edit: I get it now as to what you want. To my knowledge, there are two possibilities, both of which you have already mentioned. The older #0/1 format and the new [12]:[NY]:barcode format. A lazy way to check for which format it is would be to read in the first line and check for the existence of a space in the line. The older format shouldn't have one.

Old answer: I don't know what you mean by "all of these fine points are already coded"? But if it is this you want to check then you can just use awk as follows (for CASAVA 1.8):

awk '(NR+3)%4 == 0 && $0 ~ /1:[NY].*/ {print}' myfile.fq # prints all reads first in your paired fastq file
awk '(NR+3)%4 == 0 && $0 ~ /2:[NY].*/ {print}' myfile.fq # prints all reads second in your paired fastq file

If you don't have 2, then the reads are much likely SE. You can pipe | wc -l to count the number of lines after the awk command to ensure that you have equal reads for both pairs. Is this what you're asking? If not, sorry!

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by Arun2.3k

That's a pretty nifty awk script! But I am indeed looking for something more in-depth. Something that says, ok, the first read and second read in the file are from the same machine, etc but just differ by the member number (1 or 2)

ADD REPLYlink written 7.2 years ago by Lee Katz2.9k

Does your fastq file contain reads from more than 1 type of sequencer? (i.e., casava 1.8 and casava < 1.8)?

ADD REPLYlink written 7.2 years ago by Arun2.3k

I have fastq files from each of these sequencers.

ADD REPLYlink written 7.2 years ago by Lee Katz2.9k

usually, one tends to sequence each pair in different steps, so the usual output would be a pair of fastq files, one file per tag. if at certain point these 2 are merged I would expect to see in the headers codes like F3 and F5 for instance, although I haven't come across any paired-end single fastq file to certify this.

ADD REPLYlink written 7.2 years ago by Jorge Amigo11k
3
gravatar for Lee Katz
7.2 years ago by
Lee Katz2.9k
Atlanta, GA
Lee Katz2.9k wrote:

I think that I have a good answer but I haven't tested it on all of my data. I got a little lazy with the Casava 1.7 subroutine too. I hope it helps anyone else who is in my position, and I hope that anyone who uses it will report any problems with it.

usage: my $isPaired=is_fastqFE("file.fastq",{checkFirst=>20});

#!/usr/bin/env perl

# author: Lee Katz <lkatz@cdc.gov>
# See whether a fastq file is paired end or not. It must be in a velvet-style shuffled format.
# In other words, the left and right sides of a pair follow each other in the file.
# params: fastq file and settings
# fastq file can be gzip'd
# settings:  checkFirst is an integer to check the first X deflines
sub is_fastqPE($;$){
  my($fastq,$settings)=@_;
  $$settings{checkFirst}||=20;

  # it is paired end if it validates with either system
  my $is_pairedEnd=_is_illuminaPECasava18($fastq,$settings) || _is_illuminaPECasava17($fastq,$settings);

  die "paired: $is_pairedEnd";
  return $is_pairedEnd;
}

sub _is_illuminaPECasava18{
  my($fastq,$settings)=@_;
  my $numEntriesToCheck=$$settings{checkFirst}||20;

  my $numEntries=0;
  my $fp;
  if($fastq=~/\.gz$/){
    open($fp,"gunzip -c '$fastq' |") or die "Could not open $fastq for reading: $!";
  }else{
    open($fp,"<",$fastq) or die "Could not open $fastq for reading: $!";
  }
  while(<$fp>){
    chomp;
    s/^@//;
    #my($flowcell,$tile,$x,$y,$index,$member)= # wrong: this is an older version of illumina and will not be used here
    my($instrument,$runid,$flowcellid,$lane,$tile,$x,$yandmember,$is_failedRead,$controlBits,$indexSequence)=split(/:/,$_);
    my $discard;
    $discard=<$fp> for(1..3); # discard the sequence and quality of the read for these purposes
    my($y,$member)=split(/\s+/,$yandmember);

    # if all information is the same, except the member (1 to 2), then it is still paired until this point.
    my $secondId=<$fp>;
    chomp $secondId;
    $secondId=~s/^@//;
    my($inst2,$runid2,$fcid2,$lane2,$tile2,$x2,$yandmember2,$is_failedRead2,$controlBits2,$indexSequence2)=split(/:/,$secondId);
    $discard=<$fp> for(1..3); # discard the sequence and quality of the read for these purposes
    my($y2,$member2)=split(/\s+/,$yandmember2);

    if($instrument ne $inst2 || $runid ne $runid2 || $flowcellid ne $fcid2 || $tile ne $tile2 || $member!=1 || $member2!=2){
      #logmsg "Failed!\n$instrument,$runid,$flowcellid,$lane,$tile,$x,$yandmember,$is_failedRead,$controlBits,$indexSequence\n$inst2,$runid2,$fcid2,$lane2,$tile2,$x2,$yandmember2,$is_failedRead2,$controlBits2,$indexSequence2\n";
      close $fp;
      return 0;
    }

    $numEntries+=2;
    last if($numEntries>$numEntriesToCheck);
  }

  close $fp;

  return 1;
}

sub _is_illuminaPECasava17{
  my($fastq,$settings)=@_;
  # 20 reads is probably enough to make sure that it's shuffled
  my $numEntriesToCheck=$$settings{checkFirst}||20;
  my $numEntries=0;
  my $fp;
  if($fastq=~/\.gz$/){
    open($fp,"gunzip -c '$fastq' |") or die "Could not open $fastq for reading: $!";
  }else{
    open($fp,"<",$fastq) or die "Could not open $fastq for reading: $!";
  }
  while(my $read1Id=<$fp>){
    my $discard;
    $discard=<$fp> for(1..3);
    my $read2Id=<$fp>;
    $discard=<$fp> for(1..3);

    if($read1Id!~/\/1$/ || $read2Id!~/\/2$/){
      close $fp;
      return 0;
    }

    $numEntries+=2;
    last if($numEntries>=$numEntriesToCheck);
  }
  close $fp;

  return 1;
}
ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Lee Katz2.9k

I updated these subroutines, for whoever is using it.  Please give me feedback on the bioperl page or by email.

http://www.bioperl.org/wiki/Lee_Katz/detectPeReads

ADD REPLYlink written 5.1 years ago by Lee Katz2.9k
1
gravatar for jingtao09
7.2 years ago by
jingtao09110
jingtao09110 wrote:

Hi, I have a simple and reasonable idea. 1) if you have the reference. mapping the reads back to the reference, calculate the insert distances between odd line number and even line number by catching the start mapping positons. plot all the insert size, if you do see it in a certain normal distribution then yes. 2) if you dont have a reference. do a very simple assembly using your "single end " reads, get the draft contigs. do the same as above.

This idea will bypass examing the header information step. Good luck

ADD COMMENTlink written 7.2 years ago by jingtao09110
1
gravatar for Brian Bushnell
4.6 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I don't really understand why ancient threads get randomly bumped to the top by Biostar, but since it's here, I'll share my program for file format detection, which is fairly robust.  It detects file format (fasta, fastq, sam, scarf), interleaved vs non-interleaved, and quality encoding.

testformat.sh reads.fastq

http://sourceforge.net/projects/bbmap/

ADD COMMENTlink written 4.6 years ago by Brian Bushnell16k
1

There's a background script that will randomly bump posts. This is from Istvan in another thread (on the Galaxy instance of Biostars, since I can't find the one time he mentioned this here):

"There is a so called "bump" feature that people on Biostars have requested. It is to remind them of older posts and freshen the view during slower days. The bumping takes place every six hours. It has a 50% chance to choose either one random question from the entire corpus or one unanswered random question from the past 10 weeks.

This latter  does not take into account the accepted state, just whether the post has an answer or not. In general only people familiar with Q&A concepts do accept answers, it is not the most obvious interaction especially for newcomers."

ADD REPLYlink written 4.6 years ago by Devon Ryan91k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1344 users visited in the last hour