Question

How To Count Stand-Specific Paired-End Rna-Seq Reads Overlapping Known Protein Coding Genes ?

0

Entering edit mode

10.4 years ago

biorepine ★ 1.5k

Dear Biostars

Does any one how to overlap stand-specific paired-end RNA-Seq reads (BAM) with known protein coding genes (BED) ?

I tried the following but I think it is not the correct way ? Would appreciate your help!

bamTobed -i ES.bam > ES.bed 
intersectBed -a ES.bed -b Ensembl_mm9.bed -wa -s |awk '!a[$4]++' |wc -l

paired-end rna-seq overlap • 3.7k views

ADD COMMENT • link updated 7.7 years ago by Biostar 20 • written 10.4 years ago by biorepine ★ 1.5k

1

Entering edit mode

Why don't you just make your life easier and use featureCounts or htseq-counts? BTW, intersectBed can take a BAM file as input (use -abam instead of -a).

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

I think both packages that you mentioned take gff format but not BED.

ADD REPLY • link 10.4 years ago by biorepine ★ 1.5k

0

Entering edit mode

Exactly, just download the GTF or GFF file for mm9 (or the Ensembl annotation, since it's unclear which you're using) instead of making a BED file out of things.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

but i have my own BED files that custom made like novel transcripts.

ADD REPLY • link 10.4 years ago by biorepine ★ 1.5k

score 1 · Answer 1 · 2013-12-17

If you must use a BED file and intersectBed:

 intersectBed -a ES.bed -b Ensembl_mm9.bed -wb -s | awk '{a[$10]++}END{for(idx in a) {print idx,a[idx]}}'

Keep in mind that intersectBed shouldn't be used to count RNAseq reads, since it will increment the counter for a feature even if a read maps to not just it but another feature. As an example:

>cat foo.bed
chr1    0    100    read1    .    +
chr1    0    100    read2    .    -
chr1    50    150    read3    .    +
chr1    100    200    read4    .    +

>cat bar.bed
chr1    0    100    target1    .    +
chr1    0    100    target2    .    -
chr1    20    120    target3    .    +

>intersectBed -a foo.bed -b bar.bed -wb -s | awk '{a[$10]++}END{for(idx in a) {print idx,a[idx]}}'
target1 2
target2 1
target3 3

Only target2 should have a count (or target1 should have 1 and target3 should have 2, depending on how you overlap things). You could script around this, but it's faster to just use a different tool.

Edit: I can recommend featureCounts (from Subread) and also htseq-count for this. Making a GTF should be relatively straight-forward. Just increment the second column, set the 4th as the gene_id and transcript_id and shuffle around the remainder (adding some "." columns).