Question: How To Count Stand-Specific Paired-End Rna-Seq Reads Overlapping Known Protein Coding Genes ?
gravatar for biorepine
5.7 years ago by
biorepine1.4k wrote:

Dear Biostars

Does any one how to overlap stand-specific paired-end RNA-Seq reads (BAM) with known protein coding genes (BED) ?

I tried the following but I think it is not the correct way ? Would appreciate your help!

bamTobed -i ES.bam > ES.bed 
intersectBed -a ES.bed -b Ensembl_mm9.bed -wa -s |awk '!a[$4]++' |wc -l
overlap paired-end rna-seq • 2.4k views
ADD COMMENTlink modified 2.9 years ago by Biostar ♦♦ 20 • written 5.7 years ago by biorepine1.4k

Why don't you just make your life easier and use featureCounts or htseq-counts? BTW, intersectBed can take a BAM file as input (use -abam instead of -a).

ADD REPLYlink written 5.7 years ago by Devon Ryan91k

I think both packages that you mentioned take gff format but not BED.

ADD REPLYlink written 5.7 years ago by biorepine1.4k

Exactly, just download the GTF or GFF file for mm9 (or the Ensembl annotation, since it's unclear which you're using) instead of making a BED file out of things.

ADD REPLYlink written 5.7 years ago by Devon Ryan91k

but i have my own BED files that custom made like novel transcripts.

ADD REPLYlink written 5.7 years ago by biorepine1.4k
gravatar for Devon Ryan
5.7 years ago by
Devon Ryan91k
Freiburg, Germany
Devon Ryan91k wrote:

If you must use a BED file and intersectBed:

 intersectBed -a ES.bed -b Ensembl_mm9.bed -wb -s | awk '{a[$10]++}END{for(idx in a) {print idx,a[idx]}}'

Keep in mind that intersectBed shouldn't be used to count RNAseq reads, since it will increment the counter for a feature even if a read maps to not just it but another feature. As an example:

>cat foo.bed
chr1    0    100    read1    .    +
chr1    0    100    read2    .    -
chr1    50    150    read3    .    +
chr1    100    200    read4    .    +

>cat bar.bed
chr1    0    100    target1    .    +
chr1    0    100    target2    .    -
chr1    20    120    target3    .    +

>intersectBed -a foo.bed -b bar.bed -wb -s | awk '{a[$10]++}END{for(idx in a) {print idx,a[idx]}}'
target1 2
target2 1
target3 3

Only target2 should have a count (or target1 should have 1 and target3 should have 2, depending on how you overlap things). You could script around this, but it's faster to just use a different tool.

Edit: I can recommend featureCounts (from Subread) and also htseq-count for this. Making a GTF should be relatively straight-forward. Just increment the second column, set the 4th as the gene_id and transcript_id and shuffle around the remainder (adding some "." columns).

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Devon Ryan91k

i just voted your feature count answer. May be include that in answer as well. it could be useful to others. I will try to see whether i can convert my BED into GTF and run the tools you suggested. Great thanx anyways!!

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by biorepine1.4k

I've added that bit to the answer and provided some links. I've never needed to convert a BED file to GTF, but it should be relatively straight-forward (I mentioned briefly how to do it in my update). Keep in mind that BED files don't have any way of representing spliced features, so I hope you don't have anything that's spliced (if you do and each of the exons has the same label, then you can still deal with that, just use python or perl and sort the input).

ADD REPLYlink written 5.7 years ago by Devon Ryan91k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1504 users visited in the last hour