Question: How To Count The Reads Mapped To Exon And Exon-Exon Junctions?
gravatar for lwc628
7.3 years ago by
United States
lwc628220 wrote:

I want to make a list of reads and binary value whether it was mapped to exon or exon-exon junctions, with the ultimate purpose of counting how many reads are mapped to exons and exon-exon junctions.

I have read people reports this statistics in papers, but I couldn't find exactly how to do this anywhere online.

Let's say I have a bam file(alignment) and gff(annotations of exons).

Only way I could think of now is to manually parse the sam files( by sorting and filtering reads that are properly mapped) and merge the coordinates with the gff file using bedtools.

But Is this a right way to do this? How would you do it? Is there a standard way to do this task? Is there a standard script or program for doing this?

reads gff exon sam bam • 11k views
ADD COMMENTlink modified 15 months ago by m131131537810 • written 7.3 years ago by lwc628220
gravatar for Malachi Griffith
7.3 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith18k wrote:

Since you are referring to exon-exon junctions I am going to assume you are analyzing RNA-seq data. If you generated your BAM file with tophat, you will already have a junctions.bed file. This file contains a summary of all the exon-exon junctions found and the number of reads that mapped to each.

As for generally getting exon/transcript counts by starting with a BAM file and GTF/GFF, you might consider the htseq package. As you mentioned, bedtools can also be used to solve this problem. The bamtobed function may allow you to avoid manually parsing the BAM file.

These discussions sounds relevant:

ADD COMMENTlink modified 6.2 years ago • written 7.3 years ago by Malachi Griffith18k
gravatar for Alex Reynolds
7.3 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

Let's say I have a bam file(alignment) and gff(annotations of exons).

You can use the BEDOPS bedmap application and its --indicator operand or the --count operand to report any or count all reads which overlap exons.

First, use the BEDOPS conversion utilities to render your inputs into sorted BED files:

$ bam2bed < reads.bam > reads.bed
$ gff2bed < exons.gff > exons.bed

Then perform the indicator function operation:

$ bedmap --echo --indicator exons.bed reads.bed \
    | awk -F '|' '($2 == 1)' - \
    > exons-with-overlapping-reads.bed

The file exons-with-overlapping-reads.bed is a sorted BED file that contains exons that have one or more overlapping reads. The awk statement just filters for results that meet this condition.

If you want the actual count of reads over an exon, use the --count operator and skip the awk test:

$ bedmap --echo --count exons.bed reads.bed \
    > exons-with-number-of-overlapping-reads.bed

The file exons-with-number-of-overlapping-reads.bed is a sorted BED file that contains exons and the number of reads which overlap each exon (by one or more bases).

(The --indicator operand is the same as --count, where the indicator result is 1, or "true" if the mapped-element count is greater than zero, and 0, or "false" otherwise.)

If you're interested in subsets of the exons, use bedops --range or grep or other approaches (awk, Perl, etc.) to filter or adjust the coordinates in your exons.bed file, as needed. Then just run the bedmap steps, as previously described.

If you'd like to learn more about this toolkit, I have posted a brief summary of BEDOPS Bedops: The Fastest, Most Scalable, And Easily-Parallelizable Genome Analysis Toolkit!.

ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Alex Reynolds30k
gravatar for ewre
7.3 years ago by
United States
ewre220 wrote:

bedtools is the choice for me to cope with the task u are dealing with. usually I get multiple bam files' coverage info with

bedtools -multicov a.bam b.bam xxx > multicov.txt
ADD COMMENTlink written 7.3 years ago by ewre220

you have an extra - in your command. The correct way would be bedtools multicov a.bam b.bam annotation.bed > multicov.txt.

ADD REPLYlink modified 12 months ago • written 12 months ago by Dataman330
gravatar for m13113153781
15 months ago by
m131131537810 wrote:

here is a R package could do this ,

ADD COMMENTlink written 15 months ago by m131131537810
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 822 users visited in the last hour