Question: Extract Reads From A Bam File That Fall Within A Given Region
11
gravatar for abi
5.4 years ago by
abi180
abi180 wrote:

Hi all

This question could be considered as a follow up of this discussion. http://www.biostars.org/post/show/3407/how-to-extract-reads-from-bam-that-overlap-with-specific-regions/#3414 What I need is to extract reads from bam file that fall only within a given region (not overlap the given region), the region being in the form of a gff file or bed file. Overlapping reads could be extracted by several methods (as in the discussion mentioned or BEDTools). The idea is to try to be pretty sure of excluding 5´ UTRs in the process of detecting intergenic transcripts. I saw a tool in BamUtil (http://genome.sph.umich.edu/wiki/BamUtil) called "writeRegion" which would pretty much do what I want. Somehow could not get this running for my dataset. Was wondering if you guys might have an "R" or some other solution for this. Thanks in advance Abi

bam • 31k views
ADD COMMENTlink written 5.4 years ago by abi180

I think the options have changed since this reply was written if you want the default headerless sam file then use the command as per Damian Kao

samtools view input.bam "Chr10:18000-45500" > output.sam

if it's a bam output that you want you will need a '-b'

samtools view -b input.bam "Chr10:18000-45500" > output.bam
ADD REPLYlink written 4 months ago by KevinL0
13
gravatar for Damian Kao
5.4 years ago by
Damian Kao14k
USA
Damian Kao14k wrote:

You can extract mappings of a sam/bam file by reference and region with samtools. For example:

samtools view input.bam "Chr10:18000-45500" > output.bam

That would output all reads in Chr10 between 18000-45500 bp.

ADD COMMENTlink modified 5.4 years ago • written 5.4 years ago by Damian Kao14k

Thanks Dk for your answer, in your example the output will be all reads starting anywhere between 18000-45500. What I want is to get out only the reads which also end in this region. For example starting at 18000 or higher and ending at 45500 or lower. Abi

ADD REPLYlink written 5.4 years ago by abi180

That is confusing. Would you please explain little bit more, which reads do you want in output?

ADD REPLYlink written 5.4 years ago by Vikas Bansal2.3k

I am sorry if I am confusing you guys. Lets say I have a 4 reads with the following start and end coordinates aligned on to the chromosome.

start coordinates of the reads: 900, 1000, 1100, 1900

end coordinates of the reads: 1200, 1500,1700, 2300

My region of interest is 1000:2000

I want my output to be in this case the second and third reads starting and ending within my region of interest.

The first and last reads should be excluded because even though they overlap my region of interest, they extend away from my region (1000:2000). I hope my question is clearer now. Abi

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by abi180

do you have spliced reads as well?

ADD REPLYlink written 5.4 years ago by Arun2.2k

he wants the read with overlaps "any" at the left-end and "within" at the right end. If you consider the position 18000-45000, he needs all reads with overlap of at least 1 base to 18000, but all reads should end less than or equal to 45000.

ADD REPLYlink written 5.4 years ago by Arun2.2k

Yep Arun thats what I want and BTW, I do not have spliced reads!

ADD REPLYlink written 5.4 years ago by abi180
2

abi, this should be possible with R and IRanges. For loading a bam file, refer to this post. If you have questions about using IRanges, let me know. I'll post an example.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Arun2.2k

hi kao , can u pls tell me how to index this bam file because wen I tried to make index it showed error like: Bam header is absent. Also can u suggest any tool r the way to merge the overlapping region in bam file and make into single fasta file.

ADD REPLYlink written 19 months ago by Bioblazer30

Use "index" from samtools, described here.

samtools index test.bam
ADD REPLYlink written 18 months ago by Tom Koch100

How can you randomly extract reads from two bam files to simulate sample contaminations? I am trying to simulate sample contamination for different level of dilution for NGS samples. Suppose I have two bam files for SampleA and SampleB. I want to generate 5 contaminated samples at dilution of 10%, 20%, 30%,40% and 50% of those two samples. I understand that I should extract reads from one of the two bam files at the given dilution percentage and reassign to the other bam file, but I don't know exactly how to do this. Could you please explain me the procedure? Thanks

ADD REPLYlink modified 17 months ago • written 17 months ago by MAPK1.1k
5
gravatar for abi
5.4 years ago by
abi180
abi180 wrote:

Thanks guys, Actually the answer was there in Michael´s explanation in http://www.biostars.org/post/show/3407/how-to-extract-reads-from-bam-that-overlap-with-specific-regions/#3414 In his example, countOverlaps(genes, reads, type = "within") would do what I want. If type is within, the query interval must be wholly contained within the subject interval.

Abi

ADD COMMENTlink written 5.4 years ago by abi180

yes, that's right. This is what I thought you'd want to go and look up and come back with questions. Seems the usage is already shown here. Nice!

ADD REPLYlink written 5.4 years ago by Arun2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 958 users visited in the last hour