Question: Illumina Pair-End Library De-Novo Assembly - Broken Pairs?
4
gravatar for Martin A Hansen
7.9 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

When trimming paired-end sequences according to quality scores some reads are filtered out because they end up being too short. Thus you may end up with a sequence file where one of a pair is missing. I am looking for information on how different de-novo assemblers handle broken pairs. I am especially interested in Velvet, IDBA, and Ray.

illumina paired denovo assembly • 3.9k views
ADD COMMENTlink written 7.9 years ago by Martin A Hansen3.0k
2
gravatar for Benm
7.9 years ago by
Benm710
Benm710 wrote:

I don't think you need to input broken pairs. You can use this script below to classify them to normal pairs and single ends, then you can shuffle pairs and merge the single ends for velvet input. It is just available for Illumina reads.

#!/usr/bin/perl -w
use strict;

die "perl $0 <IN1:PE1.fastq> <IN2:PE2.fastq> <OUT1:out.PE1.fastq> <OUT2:out.PE2.fastq>\n" if (@ARGV!=4);

#@ILLUMINA-57021F:7:1:1020:11315#0/1
my ($PE1,$PE2,$out1,$out2)=@ARGV;
my %hash;
open (IN,$PE1) || die $!;
while(<IN>)
{
    if (/^\@(\S+)\#\S+\/[12]/)
    {
        $hash{$1}=$_;
        $hash{$1}.=<IN>;
        $hash{$1}.=<IN>;
        $hash{$1}.=<IN>;
    }
}
close IN;

open (IN,$PE2) || die $!;
open (OUT1,">$out1") || die $!;
open (OUT2,">$out2") || die $!;
open (SE1,">$out1.single") || die $!;
open (SE2,">$out2.single") || die $!;
while(<IN>)
{
    if (/^\@(\S+)\#\S+\/[12]/)
    {
        if (exists $hash{$1})
        {
            print OUT1 $hash{$1};
            undef $hash{$1};
            delete $hash{$1};
            print OUT2 $_;
            $_=<IN>;
            print OUT2 $_;
            $_=<IN>;
            print OUT2 $_;
            $_=<IN>;
            print OUT2 $_;
        }
        else
        {
            print SE2 $_;
            $_=<IN>;
            print SE2 $_;
            $_=<IN>;
            print SE2 $_;
            $_=<IN>;
            print SE2 $_;
        }
    }
}
foreach my $key(keys %hash)
{
    if ((defined $hash{$key})&&($hash{$key} ne ""))
    {
        print SE1 $hash{$key};
    }
}

close IN;
close OUT1;
close OUT2;
close SE1;
close SE2;
ADD COMMENTlink written 7.9 years ago by Benm710
1

I don't think Velvet is a robot, and it doesn't need to check this for your input, because it has already defined multiple input parameter, this problem should be dealt with before in data preprocessing. It won't tell you "oh, this is broken - let ... brabrabra" by current version as I know.

ADD REPLYlink written 7.9 years ago by Benm710
1

Hi, I use your script, but i found that this script use huge memory, I have two library, 8.3G and 7.8G, but the memory requirement seems to be larger than 128G.

ADD REPLYlink written 7.2 years ago by Plantae380

I know that you can input multiple libraries to Velvet. But I was wondering how Velvet (and the other assemblers) treat broken reads if supplied as paired-end reads? It would make everything a lot simpler if these assemblers automagically used the unbroken paired-end reads as paired-end reads while broken pairs were used as single end reads. I suspect this could be the case - but I fear that encountering broken pais may trigger a "oh, this is broken - lets make a single-end assembly only" response.

ADD REPLYlink written 7.9 years ago by Martin A Hansen3.0k

I'd be interested to know how BWA handles these scenarios also.

ADD REPLYlink written 7.9 years ago by Travis2.8k

The safest way would be to split the data into unbroken an broken pairs beforehand. Velvet supports explicit loading paired and unpaired reads, but IDBA and Ray don't. I'd say that a clever assembler reports the number of paired reads and broken pairs loaded - and assemble away according to this.

ADD REPLYlink written 7.9 years ago by Martin A Hansen3.0k

As Plantae already mentioned, I don't think that your script will work with most of the NGS data due to memory problem. I myself tried your script and it exceed 256GB available for our machine.

ADD REPLYlink written 7.0 years ago by Neil McCauley0
0
gravatar for Lhl
7.4 years ago by
Lhl730
United States
Lhl730 wrote:

"Note that it is not necessary to ensure that reads appear only once or that every read in the file is paired. Velvet detects which reads are unpaired and handles them as such." I read this in this paper by the author of velvet -- http://www.ncbi.nlm.nih.gov/pubmed/20836074

ADD COMMENTlink written 7.4 years ago by Lhl730
1

Yes, but that bit you quoted appears to refer specifically to providing reads in a SAM/BAM file, not FASTA/Q. The paragraph immediately following that one says, "Velvet requires that paired-end FASTA and FASTQ datasets come in a single merged file where each read is paired with the one directly above or the one directly below."

ADD REPLYlink written 7.4 years ago by Kmcarr10

Yes, Kmcarr. You are write i wrote to the author and he responded and said it is necessary to distinguish broken pairs from un-broken ones.

ADD REPLYlink written 7.4 years ago by Lhl730

Yes, Kmcarr. You are right. i wrote to the author and he responded and said it is necessary to distinguish broken pairs from paired ones.

ADD REPLYlink written 7.4 years ago by Lhl730
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 689 users visited in the last hour