Question: Extracting Genomic Coverage Information Across Different Samples
gravatar for empyrean999
6.9 years ago by
empyrean999180 wrote:


I have 3 bam files that i wanted to compare against each other. For example i have reference file with 10,000 sequences. I have paired end reads sequenced for 3 different samples.

1) Sample 1 is 100% same as reference so we expect all reads to map to it 2) Sample 2 is 80% similar to reference so 20% of reference sequences wont have any reads 3) Sample 3 is 60% similar to reference and 40% of reference wont have any reads.

Now my goal is to identify what reference sequences doesnot have any reads mapped in Sample 2 and 3.I need to identify the 20% reference sequences from Sample 2 and 40% from Sample 3.

Also in some cases in a reference which is approx 10kb long, sample 1 maps to entire 10kb, sample 2 maps to first 5kb and sample 3 maps to last 3kb. so i need to identify the partial regions for those reference sequences as well.

I have the mapped sorted bam files for all these three samples. I am looking in to using bedtools but not sure what in bedtools will give the answer i needed.

i have the following commands which might do similar but it ouputs differences at every base.

genomeCoverageBed -bg -ibam sample1.bam > sample1.bedgraph

genomeCoverageBed -bg -ibam sample2.bam > sample2.bedgraph

unionBedGraphs -header -i sample1.bedgraph sample2.bedgraph -names sample1 sample2 -g reference.fai -empty > samples1and2.txt
ADD COMMENTlink modified 6.9 years ago by Istvan Albert ♦♦ 86k • written 6.9 years ago by empyrean999180

The title of your question is incorrect, you are not comparing two bam files. You are trying to find which genomic regions do not have read coverages in various samples. (btw having an incorrect title makes it less likely to get a good answer)

ADD REPLYlink modified 16 months ago by Ram32k • written 6.9 years ago by Istvan Albert ♦♦ 86k

Any suggestions on my question?

ADD REPLYlink written 6.9 years ago by empyrean999180
gravatar for Istvan Albert
6.9 years ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

I think the way to solve this is to create sets of covered regions for each sample, then perform set operations as needed.

You can use bedtools intersect -c to produce the number of counts a gene is covered by reads.

Then use sort and use the uniq utility to count how many times does each gene appear.

ADD COMMENTlink written 6.9 years ago by Istvan Albert ♦♦ 86k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1586 users visited in the last hour