Question

Should I Remove The Unmapped Reads From My Bam ?

12

Entering edit mode

12.1 years ago

Pierre Lindenbaum 161k

is it a good practice to remove all the unmapped reads with:

samtools view -F 4

after they've been mapped with bwa ? The bam files would be smaller and the remaining operations would be faster isn't it ?

or shall I regret it later ?

samtools next-gen sequencing bam short • 13k views

ADD COMMENT • link updated 6.6 years ago by Karma ▴ 310 • written 12.1 years ago by Pierre Lindenbaum 161k

6

Entering edit mode

GATK realigner will use some unmapped reads when doing local realignment.

ADD REPLY • link 12.1 years ago by lh3 33k

0

Entering edit mode

What is the final answer for this question? For variant calling analysis should we remove unmapped regions in the .bam file?

ADD REPLY • link 6.6 years ago by Karma ▴ 310

0

Entering edit mode

. For one, future algorithms may do a better job and allow you to recover some of that data. For another, future genome builds may resolve poorly assembled regions and allow additional reads to be mapped

For variant calling analysis should we remove unmapped regions in the .bam file?

no

ADD REPLY • link 6.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

why? Is there any use of unmapped bam file during variant calling analysis?

ADD REPLY • link 6.6 years ago by Karma ▴ 310

1

Entering edit mode

GATK realigner will use some unmapped reads when doing local realignment.

ADD REPLY • link 6.6 years ago by Pierre Lindenbaum 161k

4

Entering edit mode

12.1 years ago

Christof Winter ★ 1.0k

To save space, you could as well delete your FASTQ files instead, and keep the BAM file with the unaligned reads. Now that would make Peter happy, wouldn't it?

ADD COMMENT • link 12.1 years ago by Christof Winter ★ 1.0k

0

Entering edit mode

I'd ask him about his backup strategy before deleting any raw data ;)

ADD REPLY • link 12.1 years ago by Peter 6.0k

2

Entering edit mode

12.1 years ago

Zev.Kronenberg 12k

There is a tradeoff. If you want to call variants getting rid of the unaligned reads mades things go a bit faster. However, having all the reads in one place is very convient if you ever need to go back to a project.

I usually discard the unaligned reads since dedup will throw away a bunch of reads one step downstream.

ADD COMMENT • link 12.1 years ago by Zev.Kronenberg 12k

1

Entering edit mode

To horde or not to horde that is the question :)

ADD REPLY • link 12.1 years ago by Zev.Kronenberg 12k

1

Entering edit mode

12.1 years ago

toni ★ 2.2k

What you can do also is to build a wrapper around the bwa sampe step. When this step generates the SAM file, check the flag on the fly and split into several files like a 'bam' and an 'unmapped.bam'. In Perl, it goes roughly like this :

my $command = 'bwa sampe -P -s ref.fa sai1 sai2 fq1 fq2'
my $pid = open my $COM, '-|', $command
    or croak "Could not exec $command : $!";

# Splitting output stream between several files
while( my $read = <$COM> ) {
    chomp $read;
    next if($read =~ m!^\@!); # Skip header lines

    if($read1) {
        # Second read in a pair
        $read2 = $read;

            # Process your read1 and read2 and split between several files
            # if you want. For instance pairs for which there is at least
            # one read mapped on one side, and unmapped pairs on the other
            # side. (By checking the flag)

        ($read1, $read2) = (undef,undef);  # Move to next pair
    }
    else {
        # First read in a pair
        $read1 = $read;
    }
}

close $COM or croak 'Failed to close command : ' . $command;

This way you keep all the reads of your sample (in several files) but you can process only the interesting reads if you want to.

ADD COMMENT • link 12.1 years ago by toni ★ 2.2k

1

Entering edit mode

10.1 years ago

johnblue81 ▴ 50

You can also use the remapper of segemehl and try to map the unmapped reads.

I found a manual and a presentation about the remapper in the internet:

Example of how to use it: http://www.bioinf.uni-leipzig.de/supplements/13-008
Presentation: http://www.molgen.mpg.de/919530/bernhartows.pdf

ADD COMMENT • link 10.1 years ago by johnblue81 ▴ 50

0

Entering edit mode

10.8 years ago

Toni ▴ 10

In order to remove all the unmapped reads, shouldn't we use the above command? :

samtools view -F12 samfile

ADD COMMENT • link 10.8 years ago by Toni ▴ 10

0

Entering edit mode

some tools like picard MardDuplicates removes the dup reads at the same time.

ADD REPLY • link 10.8 years ago by Pierre Lindenbaum 161k

score 15 · Accepted Answer · 2012-03-05

15

Entering edit mode

12.1 years ago

Chris Miller 22k

To future-proof your data, it seems reasonable to hold on to the unmapped reads. For one, future algorithms may do a better job and allow you to recover some of that data. For another, future genome builds may resolve poorly assembled regions and allow additional reads to be mapped.

Neither of these improvements is likely to enable huge discoveries, but the cost you're paying in storage is pretty minimal, compared to the costs of sample collection and sequencing. The speed hit probably isn't as bad as you think either, since the bam is indexed. Smart algorithms will make use of that information and not even have to consider those unmapped reads.

ADD COMMENT • link 12.1 years ago by Chris Miller 22k

0

Entering edit mode

Heng's point above is good too. Some indel/SV algorithms create new contigs of the altered sequence and do local realignment. If you toss your mapped reads, you're losing all that incredibly useful info.

ADD REPLY • link 12.1 years ago by Chris Miller 22k