Question

Density Of 5' End Of Reads From Bam Averaged Over Bed File

0

Entering edit mode

11.0 years ago

random5196648 • 0

Hi guys,

I'm trying to plot the average density of 5' ends of reads from a BAM file, for regions from a BED file. I expect the density to be average over the middle point of BED files, and the 2 strands plotted separately.

So far it doesn't seem the standard BedTools, BEDOPS or other tools (ngs.plot) are doing what I need.

Anyone have an easy solution? Or even an R package I could look at the build something custom if needed?

Thanks!

bed bam chip-seq • 2.8k views

ADD COMMENT • link updated 11.0 years ago by Alex Reynolds 35k • written 11.0 years ago by random5196648 • 0

Ram · Answer 1 · 2013-04-24

Process your bam file into a sorted BED file with your 5' ends, using BEDOPS bam2bed:

$ bam2bed < foo.bam \
    | awk '{ \
        if ($6 == "+") { \
            print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6; \
        } \
        else { \
            print $1"\t"($3+1)"\t"($3+2)"\t"$4"\t"$5"\t"$6; \
        } \
    }' - \
    > sortedFivePrimeEnds.bed

(If you're using BEDOPS v2.1, then it is not necessary to post-process with BEDOPS sort-bed. If you're using an older installation, pipe to sort-bed before piping to awk.)

Now use BEDOPS bedmap to count 5' ends within regions. We assume unsortedRegions.bed is an unsorted, four-column BED file containing the coordinates of the region and an ID describing the region. In case there are more columns, I use cut to select the first four columns:

$ sort-bed unsortedRegions.bed | cut -f1-4 - > sortedRegions.bed
$ bedmap --echo --count --delim '\t' sortedRegions.bed sortedFivePrimeEnds.bed \
    > countsOfFivePrimeEndsOverRegions.bed

If your regions do not have ID values, use awk to add a placeholder ID value (e.g., id-123, etc.) in the sorted output from sort-bed:

$ sort-bed unsortedRegions.bed \
    | awk ' \
        BEGIN { idx = 0; } \
        { \
            print $1"\t"$2"\t"$3"\tidx-"idx; \
            idx++; \
        } \
    ' - \
    > sortedRegions.bed

In any case, as sortedRegions.bed is a four-column BED file, and because we use the --delim '\t' operator with bedmap, the result file called countsOfFivePrimeEndsOverRegions.bed will contain the number of ends in the fifth (score) column.

Now use awk to measure the average count over each region:

$ awk ' \
    { \
        regionStop = $3; \
        regionStart = $2; \
        regionWidth = regionStop - regionStart; \
        endCount = $5; \
        averageEnds = endCount / regionWidth; \
        print $1"\t"$2"\t"$3"\t"$4"\t"averageEnds; \
    }' countsOfFivePrimeEndsOverRegions.bed \
    > averageEndsOverRegions.bed

The average ends result is in the fifth (score) column of averageEndsOverRegions.bed.

Once you have an understanding of the process, it is possible to glue together most of these commands into a pipeline with standard UNIX pipes, which will improve speed greatly by removing the expense of creating intermediate files.

If you need to do strand-separation, change the initial bam2bed step appropriately:

$ bam2bed < foo.bam \
    | awk '{ \
        if ($6 == "+") { \
            print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6; \
        } \
    }' - \
    > sortedForwardStrandFivePrimeEnds.bed

$ bam2bed < foo.bam \
    | awk '{ \
        if ($6 == "-") { \
            print $1"\t"($3+1)"\t"($3+2)"\t"$4"\t"$5"\t"$6; \
        } \
    }' - \
    > sortedReverseStrandFivePrimeEnds.bed

Then do the follow-up bedmap and awk processing steps on each file separately, as described.

Ram · Answer 2 · 2013-04-24

1

Entering edit mode

11.0 years ago

Pierre Lindenbaum 161k

use bed tools bamtobed to convert your bam to bed and pipe into awk to only keep the 3' position

bamToBed -i reads.bam  | awk '{printf("%s\t%d\t%d\n",$1,int($3)-1,int($3) );}

to split by strand, use samtools view with option -f or -F and the sam FLAG=16 (read reverse strand)

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 11.0 years ago by Pierre Lindenbaum 161k