Question

How to extract all reads but concordantly paired from bam file?

0

Entering edit mode

7.5 years ago

valerie ▴ 100

Hi guys,

I have bam files for reads mapped to reference genome. I want to find all reads that correspond to the sequences that are not presented in the reference genome. As far as I understand, these are all reads that are either not mapped to the reference or mapped not concordantly. I found here http://broadinstitute.github.io/picard/explain-flags.html that my FLAG should be 12 (read unmapped, mate unmapped), could someone please tell me whether this is correct?

Thanks!

alignment sequence samtools • 1.9k views

ADD COMMENT • link updated 7.5 years ago by d-cameron ★ 2.9k • written 7.5 years ago by valerie ▴ 100

score 1 · Answer 1 · 2016-11-06

1

Entering edit mode

7.5 years ago

d-cameron ★ 2.9k

all reads that correspond to the sequences that are not presented in the reference genome ... (read unmapped, mate unmapped)

Are not the same. It is possible to have a read pair in which only one read maps to the reference. What would you like to do with those?

If you just want unmapped reads then flag 4 (read unmapped) will extract all the reads not aligned to the reference which sounds like your initial requirement.

NB: If you want to extract all reads pair in which at least one of the reads contains sequence not found in the reference then it's more complicated as not only as you need to extract the read pairs in which only one read in the pair maps, you'd also need to extract partially aligned ("soft clipped") reads and insertions (although, in general, you can't tell whether a small insertion is reference sequence due to the short length) based on the CIGAR string of the alignment.

ADD COMMENT • link 7.5 years ago by d-cameron ★ 2.9k

0

Entering edit mode

You are right, I want all read pair in which at least one of the reads is not mapped. Maybe the easiest way is to generate SAM file and filter the reads with flags 83,89,147,163, that correspond to situations when both reads are mapped? Is it correct strategy? I am looking for long sequences only, that are not presented in genome, so I believe it is ok not to take CIGAR into account.

ADD REPLY • link 7.5 years ago by valerie ▴ 100

1

Entering edit mode

In that case, something like the following should work:

samtools merge \
    <(samtools view -u -f 4 -F 8 in.bam) \
    <(samtools view -u -F 4 -f 8 in.bam) \
    <(samtools view -u -f 12 in.bam)

It extracts the reads with only the read mapped, then only then mate mapped, then neither of them mapped and merges then back into a single output.

Edit:

samtools view -b -o /dev/null -U unmappped.bam -F 12 in.bam

Is a simpler solution. It selects all reads in which both are mapped, discards the output, then writes the remaining reads to unmapped.bam

ADD REPLY • link 7.5 years ago by d-cameron ★ 2.9k