Question

Bedtools Coverage Read Counts

0

Entering edit mode

7.3 years ago

gtasource ▴ 60

Using Bedtools makewindows, I generated a file that split up the genome by 5000kilobase windows. Using Bedtools Coverage, I then found how many reads fell into these 5000kb windows from a specific BAM file. Now that I am looking at the Bedtools Coverage results, I see that the following pieces of information are given:

1.) The number of features in B that overlapped (by at least one base pair) the A interval. 2.) The number of bases in A that had non-zero coverage from features in B. 3.) The length of the entry in A. 4.) The fraction of bases in A that had non-zero coverage from features in B.

For example, at Chromosome 1, loci 0 to 1000, I may see an output of the following:

CHR1 0 1000 3  30  100 0.3000000

With 3 being the number of features in B that overlapped (by at least one base pair) the A interval. With 30 being The number of bases in A that had non-zero coverage from features in B. With 100 being the length of the entry in A With 0.3000000 being the fraction of bases in A that had non-zero coverage from features in B.

If I only care about the number of reads that fall into a specific window, should I only be focused on #1 (The number of features in B that overlapped (by at least one base pair) the A interval)? In this case, being the number 3?

bedtools • 5.0k views

ADD COMMENT • link updated 7.3 years ago by Kevin Blighe 89k • written 7.3 years ago by gtasource ▴ 60

score 2 · Accepted Answer · 2018-03-17

Yes, for your situation, you want the first number, i.e., 3 features of B have overlapped the A feature (chr1:0-1000) by at least 1 base. You can modify the level of overlap, of course. Would it make sense to count something that only overlaps a 5000bp window by just a single base, for example? This is where you may additionally want to use the final figure (0.3), which indicates that only 30% of the A feature was covered by B features. This could be something like a 2-pass filtering procedure.

This simple logic is actually the exact same as that used by, for example, featureCounts, which counts reads over a GTF/GFF file. I and other colleagues have used BEDTools coverage in the past for producing raw counts from Cufflinks / StringTie-generated GTFs and BAMs. For particular RNA-seq experiments, BEDTools coverage actually does the exact same as featureCounts.

Kevin