Counting The Whole Insert Size From Paired-End Reads As Coverage
2
2
Entering edit mode
12.1 years ago

We have updated our workflows for per base sequence coverage to use genomeCoverageBed from BAM files. However for pair-end data it seems as though the regions between pair-end reads are not counted.

To be clear I am not talking about using -split for not counting introns in a single read of a paired-end, instead I am looking to count the probable whole insert when the insert size is greater than the combined read length of the paired reads.

We've looked at using iRanges from BioConductor as well but cannot tell if this would do what we want.

Is there is hidden flag in genomeCoverageBed to count the whole insert as coverage, not just the sequenced ends? Is there another program out there what would work on BAM files?

I know I can alter the SAM file before BAM conversion but this seems like something that should be coded somewhere already.

paired bedtools • 7.1k views
ADD COMMENT
2
Entering edit mode

@Sean Davis - this measures the "physical" (i.e., not the "nucleotide") coverage of a sample genome. It is often used to measure the ability to detect structural variant breakpoints by paired-end mapping, as PREM relies upon the breakpoint lying in the unsequenced interstice between the sequenced ends.

ADD REPLY
0
Entering edit mode

Why would you want to know this? What is the use case for having this information?

ADD REPLY
0
Entering edit mode

Thanks Aaron. I see. We haven't quantified things this way, but it makes perfect sense to want to do so.

ADD REPLY
0
Entering edit mode

In a word: MeDIP-seq. Nucleosomes are ~147bp. Early NGS = 33bp. If you optimise for the whole nucleosome then chromatin rearrangements are more obvious. There are other associated analysis as well.

ADD REPLY
0
Entering edit mode

Did you come up with a final solution to this problem? I realize that GATK seems to have a walker for this (http://gatkforums.broadinstitute.org/discussion/1494/computereadspancoverage), but in the latest version 3.3.0 I downloaded from the Broad Web site it does not recognize the command -T ComputeReadSpanCoverage.

ADD REPLY
5
Entering edit mode
12.1 years ago

It's an admittedly imperfect solution, but if you sort your BAM file by query name, you can convert the pairs into bedtools' BEDPE format, whose first six columns are the chrom, start and end for each end of the pair (chr1 start1 end1 chrom2 start2 end2).

In this format, you can use the chrom, start1 and end2 columns to define a simple BED format representing the full span of the pair. Something like this:

bedtools bamtobed -i aln.qsort.bam -bedpe | \
   cut -f 1,2,6 | \
   bedtools genomecov -i - -g chrom.sizes \
   > aln.span.cov
ADD COMMENT
0
Entering edit mode

this looks promising thanks

ADD REPLY
0
Entering edit mode

One obvious note - this works for intrachromosomal pairs. You may want to create a separate file of the nucleotide coverage for inter chromososomal pairs.

ADD REPLY
1
Entering edit mode

I think you forgot a sort -k1,1 in there after the cut.

And just for any other readers: The samtools command to sort bam files by name is samtools sort -n infile outfile

ADD REPLY
1
Entering edit mode
ADD COMMENT
0
Entering edit mode

almost, but I do not want summary style data. Just regular coverage but extended to the theoretical max

ADD REPLY

Login before adding your answer.

Traffic: 2698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6