Question: Getting unmapped paired-end reads as fastq from unmapped.bam obtained with tophat2
0
gravatar for cris.kgs
8 weeks ago by
cris.kgs0
Canada
cris.kgs0 wrote:

Hi, I used tophat2 to align paired-end reads to Arabidopsis genome and I want to use the unmapped.bam file to obtain the unmapped reads so I can align to a transgene. The output gives the name of the reads without the paired-end information, so I tried several scripts to extract the reads from the fastq files present on the unmapped.bam, but they did not work. I tried using HISAT2 and bowtie2, but the statistics are completely different, such as no concordant alignment when tophat2 would give me above 95% alignment rate and above 90% of concordant alignment. I want to go back to tophat2, but I need the unmapped reads.

For me the best would be continue using tophat but having a way to filter the fastq files with the unmapped.bam. Does anyone have an idea of how to do that?

Here is the code I used for the alignments: tophat2

tophat -o Control/Control_5 --library-type fr-firststrand -r 60 --mate-std-dev 50 index/TAIR10 CNT-5P_R1.fastq CNT-5P_R2.fastq

hitsat2

hisat2 --no-mixed -X 310 -x sat_index/hisat_tair -1 CNT-5P_R1.fastq -2 CNT-5P_R2.fastq -S Control/Control_5.sam --summary-file Control/Control_5.txt

bowtie2

bowtie --no-mixed -I 210 -X 310 -x index/TAIR10 -1 CNT-5P_R1.fastq -2 CNT-5P_R2.fastq -S Control/Control_5.sam
ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by cris.kgs0

Hello,

First of all, do not use tophat, it was release in 2008 and Hisat2 does a better job

Then,

The output gives the name of the reads without the paired-end information

You mean the bam file ? From which software ?

so I tried several scripts...but they did not work

Please share what you try, that will help to understand what you want

I tried using HISAT2 and bowtie2, but the statistics are completely different, such as no concordant alignment when tophat2 would give me above 95% alignment rate and above 90% of concordant alignment

I don't get the point ? Could you add an example ?

-I 210 -X 310

Why did you involve these two parameters ?

ADD REPLYlink written 8 weeks ago by Bastien Hervé3.7k

I just started hearing about HISAT2 a few months ago, so I wasn't sure I wanted to try it. As I was talking about tophat, the output is the unmapped.bam from tophat.

I tried this script from http://seqanswers.com/forums/showthread.php?t=6847&referrerid=2547

#!/usr/bin/perl
use warnings;
use strict;

my ($fastq,$sam,$outfile) = @ARGV;

unless ($outfile) {
  die "Usage is filter_unmapped_reads.pl [FastQ file] [SAM File] [File for unmapped reads]\n";
}

if (-e $outfile) {
  die "Won't overwrite an existing file, delete it first!";
}

open (FASTQ,$fastq) or die "Can't open fastq file: $!";
open (SAM,$sam) or die "Can't open SAM file: $!";
open (OUT,'>',$outfile) or die "Can't write to $outfile: $!";

my $ids = read_ids();

filter_fastq($ids);

close OUT or die "Can't write to $outfile: $!";


sub filter_fastq {

  warn "Filtering FastQ file\n";

  my ($ids) = @_;

  while (<FASTQ>) {

    if (/^@(\S+)/) {
      my $id = $1;

      # Remove the end designator from paired end reads
      $id =~ s/\/\d+$//;

      my $seq = <FASTQ>;
      my $id2 = <FASTQ>;
      my $qual = <FASTQ>;


      unless (exists $ids->{$id}) {
    print OUT $_,$seq,$id2,$qual;
      }
    }
    else {
      warn "Line '$_' should have been an id line, but wasn't\n";
    }

  }

}


sub read_ids {

  warn "Collecting mapped ids\n";

  my $ids;

  while (<SAM>) {

    next if (/^@/);
    my ($id) = split(/\t/);
    $ids->{$id} = 1;
  }

  return $ids;
}

I tried another script on R, but I can't find it anymore, anyway it would return me the whole fastq, not filtered.

About the statistics: Tophat

Left reads:
          Input     :  17099594
           Mapped   :  15049990 (88.0% of input)
            of these:    878810 ( 5.8%) have multiple alignments (1543 have >20)
Right reads:
          Input     :  17099594
           Mapped   :  15367380 (89.9% of input)
            of these:    873975 ( 5.7%) have multiple alignments (1525 have >20)
88.9% overall read mapping rate.

Aligned pairs:  14559471
     of these:    850411 ( 5.8%) have multiple alignments
                  197722 ( 1.4%) are discordant alignments
84.0% concordant pair alignment rate.

Hisat2

17099594 reads; of these:   17099594 (100.00%) were paired; of these:
    16933415 (99.03%) aligned concordantly 0 times
    1306 (0.01%) aligned concordantly exactly 1 time
    164873 (0.96%) aligned concordantly >1 times
    ----
    16933415 pairs aligned concordantly 0 times; of these:
      14513851 (85.71%) aligned discordantly 1 time
85.85% overall alignment rate

As for the -I and -X they are the information about the fragment length, in tophat I informed the distance is of 60 and the standard deviation if of 50, in hisat it asks for the max and min fragment length, as my reads are of 100bp, the lengths are min 210 and max 310.

ADD REPLYlink written 8 weeks ago by cris.kgs0

Please try to use the add reply button to answer a comment.

As for the -I and -X they are the information about the fragment length

These two parameters do not stand for fragment length but for insert size, which is the distance between a read and its mate

Remove these 2 parameters from your command line and try again

ADD REPLYlink written 8 weeks ago by Bastien Hervé3.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2444 users visited in the last hour