Question

Extracting Genomic Coverage Information Across Different Samples

0

Entering edit mode

10.1 years ago

empyrean999 ▴ 180

Hello,

I have 3 bam files that I wanted to compare against each other. For example I have reference file with 10,000 sequences. I have paired end reads sequenced for 3 different samples.

1) Sample 1 is 100% same as reference so we expect all reads to map to it 2) Sample 2 is 80% similar to reference so 20% of reference sequences wont have any reads 3) Sample 3 is 60% similar to reference and 40% of reference wont have any reads.

Now my goal is to identify what reference sequences doesnot have any reads mapped in Sample 2 and 3.I need to identify the 20% reference sequences from Sample 2 and 40% from Sample 3.

Also in some cases in a reference which is approx 10kb long, sample 1 maps to entire 10kb, sample 2 maps to first 5kb and sample 3 maps to last 3kb. so I need to identify the partial regions for those reference sequences as well.

I have the mapped sorted bam files for all these three samples. I am looking in to using bedtools but not sure what in bedtools will give the answer I needed.

I have the following commands which might do similar but it outputs differences at every base.

genomeCoverageBed -bg -ibam sample1.bam > sample1.bedgraph
genomeCoverageBed -bg -ibam sample2.bam > sample2.bedgraph
unionBedGraphs -header -i sample1.bedgraph sample2.bedgraph -names sample1 sample2 -g reference.fai -empty > samples1and2.txt

perl samtools sequencing bedtools • 2.2k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 10.1 years ago by empyrean999 ▴ 180

0

Entering edit mode

The title of your question is incorrect, you are not comparing two bam files. You are trying to find which genomic regions do not have read coverages in various samples. (btw having an incorrect title makes it less likely to get a good answer)

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 10.1 years ago by Istvan Albert 100k

0

Entering edit mode

Any suggestions on my question?

ADD REPLY • link 10.1 years ago by empyrean999 ▴ 180

score 0 · Answer 1 · 2014-03-24

I think the way to solve this is to create sets of covered regions for each sample, then perform set operations as needed.

You can use bedtools intersect -c to produce the number of counts a gene is covered by reads.

Then use sort and use the uniq utility to count how many times does each gene appear.