Isolating reads from specific region from bam file
1
1
Entering edit mode
4.1 years ago
Sara ▴ 160

I have RNA-seq data and would like to get the reads from specific region for example -/+ 50 reads before and after stop site. do you know how I can get it from bam file

RNA-Seq • 3.7k views
1
Entering edit mode

Hello Sara, You may find some info in this post: C: alignment containing sequences from position a to b

4
Entering edit mode
4.1 years ago
bruce.moran ▴ 870

Easiest way is SAMtools view, you can specify a single region:

samtools view -h <your.bam> chr1:1000-2000


or you can give a BED file (I think in tab-delimited format):

samtools view -h -L <your.bed> <your.bam>


Make sure to use same chromosomal naming (either with 'chr' or without) as in the BAM. You can check with

samtools  view -H <your.bam>

0
Entering edit mode

Hey Bruce, not sure why this wasn't upvoted (I upvoted just now), as it's the best and most concise answer I've seen on this particular question.

1
Entering edit mode

Thanks, I find it particularly useful for IGV or other viewers, instead of trying to load a whole BAM locally.

0
Entering edit mode

will this include only the reads the fully lie in the position? i.e. 900-1200 no, and 1100-1400 yes?

1
Entering edit mode

No, any that overlap the interval. You can, nevertheless, still use the above command and then pipe the output to, e.g., awk, in order to further filter for reads only in each desired interval.

0
Entering edit mode

could please give me an example how to do this? Or a link to something similar?

0
Entering edit mode

There is likely some 'random' script out there that can do this. Perhaps even BEDTools has, these days, some way to do this.

What you need to do is start with the POS field in the BAM/SAM file, which represents the left-most co-ordinate to which your read aligns / maps. Then, you have to parse the CIGAR string to understand how many more bases are involved in the alignment.

...or, if all of your reads are 50bp and your interval of interest is 100bp, then you just need to keep any read whose POS value starts between base 1 and 50.