Question: Bed File Of Mapq Sliding Window On A Bam File?
4
gravatar for 14134125465346445
5.1 years ago by
United Kingdom
141341254653464453.4k wrote:

There may already be a recipe for this, so asking first before reinventing the wheel:

I would like to create a bed file where the score is the average mapQ from the reads of the input.bam file. I think bedtools or bedops are the way to go:

http://bedtools.readthedocs.org/en/latest/content/tools/bamtobed.html
http://bedops.readthedocs.org/en/latest/content/reference/file-management/conversion/bam2bed.html

Other than simply running bamtobed/bam2bed, I would like to be able to define a sliding window size and step for the windows, of say, size=1000 and step=200.

I also would like to generate the bam2bed information only from a list of regions in regions.bed. E.g., something like:

mapq_sliding_windows --bam input.bam --wsize 1000 -wstep 200 --regions regions.bed > mapq_sliding_windows.bed

EDITED:

Thank you Aaron for you answer. I got it working but it's slow for my 30x WGS bams:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome    
bedtools makewindows -g hg19.genome -w 1000 -s 200 > hg19.windows.bed    
bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i input.bam | grep -v chrM) -c 5 -o mean > input.mapq.bed

Last step is taking 6 hours and still running on an 86G bam. Any possible ways to make it faster?

bedtools bed bam • 3.1k views
ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by 141341254653464453.4k
4
gravatar for Aaronquinlan
5.1 years ago by
Aaronquinlan10k
United States
Aaronquinlan10k wrote:

Assuming your BAM file is sorted by chromosome and start position, and the chromosome ordering in the BAM and BED file are the same, the following bedtools solution will work.

Step 1. Make a BED file of sliding windows.

bedtools makewindows -g genomes/human.hg19.genome -w 1000 -s 200 > hg19.windows.bed
head hg19.windows.bed
chr1    0    1000
chr1    200    1200
chr1    400    1400
chr1    600    1600
chr1    800    1800
chr1    1000    2000
chr1    1200    2200
chr1    1400    2400
chr1    1600    2600
chr1    1800    2800

Step 2. Compute the mean MAPQ (column 5) for each window.

bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i aln.bed) -c 5 -o mean
ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Aaronquinlan10k

hmmm, where is the bam file here? Maybe aln.bed should be input.bam? I am getting lots of '.' instead of scores in the output of bedtools map...

bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i input.bam) -c 5 -o mean | head
chr1 0 1000 .
chr1 200 1200 .
chr1 400 1400 .
chr1 600 1600 .
chr1 800 1800 .
chr1 1000 2000 .
chr1 1200 2200 .
chr1 1400 2400 .
chr1 1600 2600 .
chr1 1800 2800 .

bedtools --version bedtools v2.16.2

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by 141341254653464453.4k

Sorry, I thought aln.bed would be clear. Replaced with input.bed. The "." is the default NULL value for the map tool. If no reads align to a window, then there is no average MAPQ. The snippet you are showing is at the telomere.

ADD REPLYlink written 5.1 years ago by Aaronquinlan10k

Thanks, I think my problem was that a large part of the results were '.', but skipping the chrM seems to do the trick:

bedtools map -a hg19.windows.bed -b <(bedtools bamtobed -i input.bam | grep -v chrM) -c 5 -o mean > input.mapq.bed

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by 141341254653464453.4k

Adding the final commands in the question. It works but it's slow.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by 141341254653464453.4k

Not sure how fast you were expecting given the size of BAM files, but the 2.19.0 version of bedtools provides significant speed ups for the map tool. Based on a 86Gb file, I would still expect it to take on the order of an hour of two. https://github.com/arq5x/bedtools2/releases/tag/v2.19.0

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Aaronquinlan10k
1

Thanks for the info. Compiling with gcc 4.7.2, I get about 30,000 lines per minute with bedtools version 2.17.0, and about 100,000 lines per minute with bedtools2 version 2.19.0. This is the command-line (rounding up value with awk and compressing output):

/illumina/thirdparty/bedtools/bedtools2-2.19.0/bin/bedtools map -a /illumina/development/stripes/misc/hg19.windows.bed -b <(/illumina/thirdparty/bedtools/bedtools2-2.19.0/bin/bedtools bamtobed -i bad.bam | grep -v chrM) -c 5 -o mean | awk -F. '{print $1".0"}' | gzip -c > bad.mean.mapq.bed.gz

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by 141341254653464453.4k

That is about the speedup I would have expected. The forthcoming 2.20.0 version will provide another 2X speedup. Probably a week or two away.

ADD REPLYlink written 5.1 years ago by Aaronquinlan10k

Excellent! Looking forward to it!

ADD REPLYlink written 5.1 years ago by 141341254653464453.4k
3
gravatar for sjneph
5.1 years ago by
sjneph600
sjneph600 wrote:

bam2bed < your-file.bam | cut -f1-4,7 > your-file.bed
make-regions 1000 200 regions.bed | bedmap --echo --mean - your-file.bed > answer.bed

Where make-regions is an awk statement, perhaps buried in a shell script

awk -v winsize=$1 -v stepsize=$2 \
'{ \
  i = $2; \
  while ( i < $3 ) { \
    j = i+winsize; \
    if ( j > $3 ) { j = $3; } \
    print $1"\t"i"\t"j; \
    i += stepsize; \
  } \
}' $3

bam2bed will convert your bam file with no loss of information. The regions.bed file needs to be sorted with the sort-bed utility.

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by sjneph600

updated to put MAPQ in field 5 of your-file.bed since that's what you want to average over.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by sjneph600
2
gravatar for Zev.Kronenberg
5.1 years ago by
United States
Zev.Kronenberg11k wrote:

I wrote a tool (whammy) that calculates the average mapping quality of a position. With some window smoothing you have the answer you want.

https://github.com/jewmanchue/wham

A tab delimited text file. Columns:

Seqid.
Position.
Number of unmapped mates (target)
Number of unmapped mates (background)
Number of same strand mates (target)
Number of same strand mates (background)
Number of cross seqid mapped mates (target)
Number of cross seqid mapped mates (background)
Depth - Number of reads covering position (target)
Depth - Number of reads covering position (background)
Average insert length (target)
Average insert length (background)
Average mapping quality (target)
Average mapping quality (background)
Euclidean distances based on columns (3,4;5,6;7,8;13,14)

Since you probably don't have a target and background you can just put the same bam into option -t and option -b

ADD COMMENTlink written 5.1 years ago by Zev.Kronenberg11k

I actually want to compare two bams in certain situations, so great to see it does that :-)

ADD REPLYlink written 5.1 years ago by 141341254653464453.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1278 users visited in the last hour