Question

Counting The Whole Insert Size From Paired-End Reads As Coverage

2

Entering edit mode

12.1 years ago

Alastair Kerr 5.3k

We have updated our workflows for per base sequence coverage to use genomeCoverageBed from BAM files. However for pair-end data it seems as though the regions between pair-end reads are not counted.

To be clear I am not talking about using -split for not counting introns in a single read of a paired-end, instead I am looking to count the probable whole insert when the insert size is greater than the combined read length of the paired reads.

We've looked at using iRanges from BioConductor as well but cannot tell if this would do what we want.

Is there is hidden flag in genomeCoverageBed to count the whole insert as coverage, not just the sequenced ends? Is there another program out there what would work on BAM files?

I know I can alter the SAM file before BAM conversion but this seems like something that should be coded somewhere already.

paired bedtools • 7.1k views

ADD COMMENT • link updated 2.0 years ago by Ram 43k • written 12.1 years ago by Alastair Kerr 5.3k

2

Entering edit mode

@Sean Davis - this measures the "physical" (i.e., not the "nucleotide") coverage of a sample genome. It is often used to measure the ability to detect structural variant breakpoints by paired-end mapping, as PREM relies upon the breakpoint lying in the unsequenced interstice between the sequenced ends.

ADD REPLY • link 12.1 years ago by Aaronquinlan 12k

0

Entering edit mode

Why would you want to know this? What is the use case for having this information?

ADD REPLY • link 12.1 years ago by Sean Davis 26k

0

Entering edit mode

Thanks Aaron. I see. We haven't quantified things this way, but it makes perfect sense to want to do so.

ADD REPLY • link 12.1 years ago by Sean Davis 26k

0

Entering edit mode

In a word: MeDIP-seq. Nucleosomes are ~147bp. Early NGS = 33bp. If you optimise for the whole nucleosome then chromatin rearrangements are more obvious. There are other associated analysis as well.

ADD REPLY • link 12.1 years ago by Alastair Kerr 5.3k

0

Entering edit mode

Did you come up with a final solution to this problem? I realize that GATK seems to have a walker for this (http://gatkforums.broadinstitute.org/discussion/1494/computereadspancoverage), but in the latest version 3.3.0 I downloaded from the Broad Web site it does not recognize the command -T ComputeReadSpanCoverage.

ADD REPLY • link updated 2.0 years ago by Ram 43k • written 9.2 years ago by Christian ★ 3.0k

score 5 · Answer 1 · 2012-03-07

5

Entering edit mode

12.1 years ago

Aaronquinlan 12k

It's an admittedly imperfect solution, but if you sort your BAM file by query name, you can convert the pairs into bedtools' BEDPE format, whose first six columns are the chrom, start and end for each end of the pair (chr1 start1 end1 chrom2 start2 end2).

In this format, you can use the chrom, start1 and end2 columns to define a simple BED format representing the full span of the pair. Something like this:

bedtools bamtobed -i aln.qsort.bam -bedpe | \
   cut -f 1,2,6 | \
   bedtools genomecov -i - -g chrom.sizes \
   > aln.span.cov

ADD COMMENT • link 12.1 years ago by Aaronquinlan 12k

0

Entering edit mode

this looks promising thanks

ADD REPLY • link 12.1 years ago by Alastair Kerr 5.3k

0

Entering edit mode

One obvious note - this works for intrachromosomal pairs. You may want to create a separate file of the nucleotide coverage for inter chromososomal pairs.

ADD REPLY • link 12.1 years ago by Aaronquinlan 12k

1

Entering edit mode

I think you forgot a sort -k1,1 in there after the cut.

And just for any other readers: The samtools command to sort bam files by name is samtools sort -n infile outfile

ADD REPLY • link 11.4 years ago by Maximilian Haeussler ★ 1.6k

Ram · Answer 2 · 2012-03-06

1

Entering edit mode

12.1 years ago

Zev.Kronenberg 12k

Might be useful?

How To Get The "Library Insert Size" Out Of A Sam/Bam File?

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.1 years ago by Zev.Kronenberg 12k

0

Entering edit mode

almost, but I do not want summary style data. Just regular coverage but extended to the theoretical max

ADD REPLY • link 12.1 years ago by Alastair Kerr 5.3k