Question: Should I Remove The Unmapped Reads From My Bam ?
7
gravatar for Pierre Lindenbaum
5.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum102k wrote:

is it a good practice to remove all the unmapped reads with:

samtools view -F 4

after they've been mapped with bwa ? The bam files would be smaller and the remaining operations would be faster isn't it ?

or shall I regret it later ?

ADD COMMENTlink modified 12 weeks ago by Snijesh VP180 • written 5.8 years ago by Pierre Lindenbaum102k
6

GATK realigner will use some unmapped reads when doing local realignment.

ADD REPLYlink written 5.8 years ago by lh330k

What is the final answer for this question? For variant calling analysis should we remove unmapped regions in the .bam file?

ADD REPLYlink written 12 weeks ago by Snijesh VP180

. For one, future algorithms may do a better job and allow you to recover some of that data. For another, future genome builds may resolve poorly assembled regions and allow additional reads to be mapped

For variant calling analysis should we remove unmapped regions in the .bam file?

no

ADD REPLYlink written 12 weeks ago by Pierre Lindenbaum102k

why? Is there any use of unmapped bam file during variant calling analysis?

ADD REPLYlink written 12 weeks ago by Snijesh VP180
1

GATK realigner will use some unmapped reads when doing local realignment.

ADD REPLYlink written 12 weeks ago by Pierre Lindenbaum102k
10
gravatar for Chris Miller
5.8 years ago by
Chris Miller19k
Washington University in St. Louis, MO
Chris Miller19k wrote:

To future-proof your data, it seems reasonable to hold on to the unmapped reads. For one, future algorithms may do a better job and allow you to recover some of that data. For another, future genome builds may resolve poorly assembled regions and allow additional reads to be mapped.

Neither of these improvements is likely to enable huge discoveries, but the cost you're paying in storage is pretty minimal, compared to the costs of sample collection and sequencing. The speed hit probably isn't as bad as you think either, since the bam is indexed. Smart algorithms will make use of that information and not even have to consider those unmapped reads.

ADD COMMENTlink written 5.8 years ago by Chris Miller19k

Heng's point above is good too. Some indel/SV algorithms create new contigs of the altered sequence and do local realignment. If you toss your mapped reads, you're losing all that incredibly useful info.

ADD REPLYlink written 5.8 years ago by Chris Miller19k
2
gravatar for Zev.Kronenberg
5.8 years ago by
United States
Zev.Kronenberg11k wrote:

There is a tradeoff. If you want to call variants getting rid of the unaligned reads mades things go a bit faster. However, having all the reads in one place is very convient if you ever need to go back to a project.

I usually discard the unaligned reads since dedup will throw away a bunch of reads one step downstream.

ADD COMMENTlink written 5.8 years ago by Zev.Kronenberg11k
1

To horde or not to horde that is the question :)

ADD REPLYlink written 5.8 years ago by Zev.Kronenberg11k
2
gravatar for Christof Winter
5.8 years ago by
Lund, Sweden
Christof Winter890 wrote:

To save space, you could as well delete your FASTQ files instead, and keep the BAM file with the unaligned reads. Now that would make Peter happy, wouldn't it?

ADD COMMENTlink written 5.8 years ago by Christof Winter890

I'd ask him about his backup strategy before deleting any raw data ;)

ADD REPLYlink written 5.8 years ago by Peter5.6k
1
gravatar for toni
5.8 years ago by
toni2.1k
Lyon
toni2.1k wrote:

What you can do also is to build a wrapper around the bwa sampe step. When this step generates the SAM file, check the flag on the fly and split into several files like a 'bam' and an 'unmapped.bam'. In Perl, it goes roughly like this :

my $command = 'bwa sampe -P -s ref.fa sai1 sai2 fq1 fq2'
my $pid = open my $COM, '-|', $command
    or croak "Could not exec $command : $!";

# Splitting output stream between several files
while( my $read = <$COM> ) {
    chomp $read;
    next if($read =~ m!^\@!); # Skip header lines

    if($read1) {
        # Second read in a pair
        $read2 = $read;

            # Process your read1 and read2 and split between several files
            # if you want. For instance pairs for which there is at least
            # one read mapped on one side, and unmapped pairs on the other
            # side. (By checking the flag)

        ($read1, $read2) = (undef,undef);  # Move to next pair
    }
    else {
        # First read in a pair
        $read1 = $read;
    }
}

close $COM or croak 'Failed to close command : ' . $command;

This way you keep all the reads of your sample (in several files) but you can process only the interesting reads if you want to.

ADD COMMENTlink written 5.8 years ago by toni2.1k
1
gravatar for johnblue81
3.7 years ago by
johnblue8150
johnblue8150 wrote:

You can also use the remapper of segemehl and try to map the unmapped reads.

I found a manual and a presentation about the remapper in the internet:

ADD COMMENTlink written 3.7 years ago by johnblue8150
0
gravatar for Toni
4.5 years ago by
Toni10
Toni10 wrote:

In order to remove all the unmapped reads, shouldn't we use the above command? :

samtools view -F12 samfile

ADD COMMENTlink modified 4.4 years ago • written 4.5 years ago by Toni10

some tools like picard MardDuplicates removes the dup reads at the same time.

ADD REPLYlink written 4.5 years ago by Pierre Lindenbaum102k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1496 users visited in the last hour