How To Count The Reads Mapped To Exon And Exon-Exon Junctions?
4
3
Entering edit mode
8.5 years ago
lwc628 ▴ 220

I want to make a list of reads and binary value whether it was mapped to exon or exon-exon junctions, with the ultimate purpose of counting how many reads are mapped to exons and exon-exon junctions.

I have read people reports this statistics in papers, but I couldn't find exactly how to do this anywhere online.

Let's say I have a bam file(alignment) and gff(annotations of exons).

Only way I could think of now is to manually parse the sam files( by sorting and filtering reads that are properly mapped) and merge the coordinates with the gff file using bedtools.

But Is this a right way to do this? How would you do it? Is there a standard way to do this task? Is there a standard script or program for doing this?

exon sam bam reads gff • 12k views
ADD COMMENT
7
Entering edit mode
8.5 years ago

Since you are referring to exon-exon junctions I am going to assume you are analyzing RNA-seq data. If you generated your BAM file with tophat, you will already have a junctions.bed file. This file contains a summary of all the exon-exon junctions found and the number of reads that mapped to each.

As for generally getting exon/transcript counts by starting with a BAM file and GTF/GFF, you might consider the htseq package. As you mentioned, bedtools can also be used to solve this problem. The bamtobed function may allow you to avoid manually parsing the BAM file.

These discussions sounds relevant:

ADD COMMENT
1
Entering edit mode
8.5 years ago

Let's say I have a bam file(alignment) and gff(annotations of exons).

You can use the BEDOPS bedmap application and its --indicator operand or the --count operand to report any or count all reads which overlap exons.

First, use the BEDOPS conversion utilities to render your inputs into sorted BED files:

$ bam2bed < reads.bam > reads.bed
$ gff2bed < exons.gff > exons.bed

Then perform the indicator function operation:

$ bedmap --echo --indicator exons.bed reads.bed \
    | awk -F '|' '($2 == 1)' - \
    > exons-with-overlapping-reads.bed

The file exons-with-overlapping-reads.bed is a sorted BED file that contains exons that have one or more overlapping reads. The awk statement just filters for results that meet this condition.

If you want the actual count of reads over an exon, use the --count operator and skip the awk test:

$ bedmap --echo --count exons.bed reads.bed \
    > exons-with-number-of-overlapping-reads.bed

The file exons-with-number-of-overlapping-reads.bed is a sorted BED file that contains exons and the number of reads which overlap each exon (by one or more bases).

(The --indicator operand is the same as --count, where the indicator result is 1, or "true" if the mapped-element count is greater than zero, and 0, or "false" otherwise.)

If you're interested in subsets of the exons, use bedops --range or grep or other approaches (awk, Perl, etc.) to filter or adjust the coordinates in your exons.bed file, as needed. Then just run the bedmap steps, as previously described.

If you'd like to learn more about this toolkit, I have posted a brief summary of BEDOPS Bedops: The Fastest, Most Scalable, And Easily-Parallelizable Genome Analysis Toolkit!.

ADD COMMENT
1
Entering edit mode
8.5 years ago
ewre ▴ 240

bedtools is the choice for me to cope with the task u are dealing with. usually I get multiple bam files' coverage info with

bedtools -multicov a.bam b.bam xxx > multicov.txt
ADD COMMENT
0
Entering edit mode

you have an extra - in your command. The correct way would be bedtools multicov a.bam b.bam annotation.bed > multicov.txt.

ADD REPLY
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 1950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6