Density Of 5' End Of Reads From Bam Averaged Over Bed File
2
0
Entering edit mode
11.0 years ago

Hi guys,

I'm trying to plot the average density of 5' ends of reads from a BAM file, for regions from a BED file. I expect the density to be average over the middle point of BED files, and the 2 strands plotted separately.

So far it doesn't seem the standard BedTools, BEDOPS or other tools (ngs.plot) are doing what I need.

Anyone have an easy solution? Or even an R package I could look at the build something custom if needed?

Thanks!

bed bam chip-seq • 2.8k views
ADD COMMENT
4
Entering edit mode
11.0 years ago

Process your bam file into a sorted BED file with your 5' ends, using BEDOPS bam2bed:

$ bam2bed < foo.bam \
    | awk '{ \
        if ($6 == "+") { \
            print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6; \
        } \
        else { \
            print $1"\t"($3+1)"\t"($3+2)"\t"$4"\t"$5"\t"$6; \
        } \
    }' - \
    > sortedFivePrimeEnds.bed

(If you're using BEDOPS v2.1, then it is not necessary to post-process with BEDOPS sort-bed. If you're using an older installation, pipe to sort-bed before piping to awk.)

Now use BEDOPS bedmap to count 5' ends within regions. We assume unsortedRegions.bed is an unsorted, four-column BED file containing the coordinates of the region and an ID describing the region. In case there are more columns, I use cut to select the first four columns:

$ sort-bed unsortedRegions.bed | cut -f1-4 - > sortedRegions.bed
$ bedmap --echo --count --delim '\t' sortedRegions.bed sortedFivePrimeEnds.bed \
    > countsOfFivePrimeEndsOverRegions.bed

If your regions do not have ID values, use awk to add a placeholder ID value (e.g., id-123, etc.) in the sorted output from sort-bed:

$ sort-bed unsortedRegions.bed \
    | awk ' \
        BEGIN { idx = 0; } \
        { \
            print $1"\t"$2"\t"$3"\tidx-"idx; \
            idx++; \
        } \
    ' - \
    > sortedRegions.bed

In any case, as sortedRegions.bed is a four-column BED file, and because we use the --delim '\t' operator with bedmap, the result file called countsOfFivePrimeEndsOverRegions.bed will contain the number of ends in the fifth (score) column.

Now use awk to measure the average count over each region:

$ awk ' \
    { \
        regionStop = $3; \
        regionStart = $2; \
        regionWidth = regionStop - regionStart; \
        endCount = $5; \
        averageEnds = endCount / regionWidth; \
        print $1"\t"$2"\t"$3"\t"$4"\t"averageEnds; \
    }' countsOfFivePrimeEndsOverRegions.bed \
    > averageEndsOverRegions.bed

The average ends result is in the fifth (score) column of averageEndsOverRegions.bed.

Once you have an understanding of the process, it is possible to glue together most of these commands into a pipeline with standard UNIX pipes, which will improve speed greatly by removing the expense of creating intermediate files.

If you need to do strand-separation, change the initial bam2bed step appropriately:

$ bam2bed < foo.bam \
    | awk '{ \
        if ($6 == "+") { \
            print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6; \
        } \
    }' - \
    > sortedForwardStrandFivePrimeEnds.bed

$ bam2bed < foo.bam \
    | awk '{ \
        if ($6 == "-") { \
            print $1"\t"($3+1)"\t"($3+2)"\t"$4"\t"$5"\t"$6; \
        } \
    }' - \
    > sortedReverseStrandFivePrimeEnds.bed

Then do the follow-up bedmap and awk processing steps on each file separately, as described.

ADD COMMENT
1
Entering edit mode
11.0 years ago

use bed tools bamtobed to convert your bam to bed and pipe into awk to only keep the 3' position

bamToBed -i reads.bam  | awk '{printf("%s\t%d\t%d\n",$1,int($3)-1,int($3) );}

to split by strand, use samtools view with option -f or -F and the sam FLAG=16 (read reverse strand)

ADD COMMENT

Login before adding your answer.

Traffic: 3832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6