Question

quantify counts from sorted bam files using a bed file containing regions of interest

0

Entering edit mode

2.1 years ago

bioyas ▴ 10

Hi everyone,

I have generated sorted bam files for multiple samples. I also have generated a particular BED file (tab limited file with Chr, start and end regions of interest). I would like to quantify reads mapped to these regions and generate a count matrix.

I used the following command to generate bam files for the desired regions stored in my.bed file (using original sample.bam)

samtools view -h -L my.bed sample.bam

The above command generates the new bam files. I sorted the bam file using sam sort. Now I would like to quantify the amount of reads mapped to each desired region as counts from the new bam file. I tried to use featureCount from subRead as below:

featureCounts -A my.Bed -o SampleCounts.txt sample.Sorted.bam

This command does not generate the counts. Am I supposed to only pass gtf file as annotation to feature count? Or my steps are wrong? Any idea is really appreciated.

Thanks

counts featureCount bam Bed • 1.9k views

ADD COMMENT • link updated 2.1 years ago by Malcolm.Cook ★ 1.5k • written 2.1 years ago by bioyas ▴ 10

score 1 · Answer 1 · 2022-03-07

For overlap and counting purposes based on a 3 column bed,

Bedtools intersect should help. Link

Please do see the different options like -c, -u, -wa, -wb which are very helpful under various contexts.

Also related are

https://bedtools.readthedocs.io/en/latest/content/tools/multicov.html https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html

But if the purpose is differential expression input count or so, I believe it is recommended use featureCounts or HTSeq etc.

And on featurecounts query, I am not sure if gtf is 'mandatory' but mostly gtf is seen used with feature counts.To slightly detail, in use cases like RNA-Seq the exon-gene mapping info from gtf are quite very relevant with respect to the counting.

score 0 · Answer 2 · 2022-03-07

featureCounts does not work with .bed files.

If you are intent upon using some of the capabilities of featureCounts, you would want to first create a SAF or GFF formatted version of your .bed data.

This is pretty much a rearrangement of columns, with the possible addition of a unique first column to be the identifier of the region.

If your bed file file also has a name column (sounds like yours does not, but I thought I'd check), you might want to use it as the identifier. The unix cut command can easily reorder columns.

If your bed file does NOT already have a unique identifier, you need an approach to creating one. Converting from BED to SAF/GFF gives some approaches for doing this. If I were to use the awk based approach, the only thing I would change about it would be to use a colon to separate the chromosome from the start position instead of a period, yielding identifiers that you can use to navigate your favorite genome browser (e.g. IGV, UCSC).