Pull all unmapped reads from BAM file?
1
0
Entering edit mode
8.8 years ago
dnlsmy ▴ 10

Hi,

I'm looking to get all the unmapped regions from a sequenced BAM file. The purpose of this is that I'm trying to find genetic recombination of a gene/region, so I'm guessing that a successful recombination would result in the read being marked as unmapped "trash" and that I could search through the unmapped regions and find it. The basic idea is

Ref Genome w/ Mapped Reads
AAA*****************************BBB
Unmapped Read
AAA**BBB

A couple of problems I'm having is that while I can use

samtools view -b -f 4 ORIGINAL.bam > UNMAPPED.bam

to presumably get all the unmapped regions in the bam file, I am unsure on how to go about searching through the reads individually and looking for recombination (where AAA and BBB would be closer than normal, present on the same read). I have a java program to take in reads as strings and identify recombination if the AAA and BBB regions are sufficiently close, recording the distance as well.

I'm basically stuck on extracting the reads, and also I've been wondering if this is the best way to go about finding recombination. I haven't been in the field long enough but I would assume that this or a similar workflow would be pretty common.

Thanks!

samtools • 3.6k views
ADD COMMENT
0
Entering edit mode

How divergent are the sequences that you suspect to be recombining? If divergence is low (below 10% identity), the read will map randomly between them.

ADD REPLY
0
Entering edit mode

I'm looking at VDJ recombination specifically, so I'm guessing the divergence would be high for the DJ sequence to move very close to the V region. I guess I could pull the mapped regions as well to make sure I get the reads that have low divergence but what I'm really having a hard time doing is getting the unmapped regions.

ADD REPLY
0
Entering edit mode

dnlsmy Have you tried running sensitive SV detection software?

ADD REPLY
0
Entering edit mode

I have not, would recombination be detected using that software? My main worry is that since the BAM file is already sequenced, recombination of VDJ regions would not be able to be mapped on the reference sequence (since the regions would be much larger than a simple SNV). My understanding is that in that case, the BAM file would store those reads as unmapped and searching through those for the V regions would allow me to check the degree of recombination (by checking to see how far along the D/J regions were upstream from the read).

ADD REPLY
2
Entering edit mode

If you mapped with BWA mem, there is a good chance these reads have split read mappings. Here is a little test to see if your hypothesis is correct.

samtools view -f 4 your.bam vdjregion | wc
samtools view your.bam vdjregion | wc
samtools view -f 4 your.bam non-vdj-region | wc
samtools view your.bam non-vdj-region | wc

This will tell you if there are extra unmapped reads in the regions.

HINT: UNMAPPED READS ARE ASSIGNED A POSTION and will be returned in the region.

getting the split reads:

samtools view your.bam region | grep "SA:"
ADD REPLY
0
Entering edit mode

Yes our lab develops SV software and we always have to filter out Ab parts and TCRs for our blood samples.

This is true for an SV caller which is sensitive for NAHR and another which is more sensitive to NHEJ.

https://en.wikipedia.org/wiki/Structural_variation#Software has a decent list however I do recommend Lumpy too.

SV callers are full of noise and the raw results are difficult to interpret, however for your question it seems that an SV caller will help you get an idea of probable sites of VDJ recombination.

Delly and Lumpy are two SV callers that can detect complex breaks and translocations. Delly was heavily used in the upcoming 1000 genomes phase 3 SV paper.

ADD REPLY
3
Entering edit mode
8.8 years ago

As Zev.Kronenberg mentioned before:

If you want to test recombination, your "unmapped" reads are actually mapped. You want to parse out split reads (reads with secondary alignments)

Meaning that the first 50bp of your read maps to chr1:500 and the second half of the read maps to chr2:6000

Finding these reads depends on the version of BWA you use, I'm not sure which version you need though. But you can tell if you have split reads by using the method Zev mentioned.

$ samtools view in.bam chr1:100-1000 | grep "SA:"

However, expect a lot of noise in your data!

You may want to limit the informative reads to where they map. For CNVs you would want to make sure the secondary alignment maps to the same chromosome. I also find that a mapping quality filter of 20 removes a lot of noise, but depends on your experiment.

http://samtools.github.io/hts-specs/SAMv1.pdf ( search for "Chimeric alignment")

Summary:

SA:Z:5,-111391793,91M34S,45,0

Indicates a secondary alignment, it is mapped to chr5 on the minus strand at position 111,391,793 with a mapping quality of 45; 91M34S is the cigar string and you can see that 34 nucleotides have a soft clip and align to another locus.

This is the command I use to get the counts. I am interested in CNVs and the first window represents the probable start site and the second the end window.

$ samtools view -q 20 in.bam 5:111391382-111392382 5:111397620-111398620 | sort | uniq | wc -l
ADD COMMENT

Login before adding your answer.

Traffic: 2888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6