Question: Bin chromosome every 1kb and get average value
1
gravatar for dp0b
2.5 years ago by
dp0b60
dp0b60 wrote:

Hi,

I have a large file of each chromosome for sequence data and I want to get the average score in 1kb bins? Is there an easy way to do to this?

chr1 123 127 1890
chr1 208 210 90
chr1 215 220 50
chr1 225 230 100
chr1 250 300 800

I have tried bedtools, it will create the windows but then how do I get the average of the fourth value in that window? It also seems to be quite slow.

bedtools makewindows -w 1000 -g chromosome.txt > windows.txt

Any help would be much appreciated.

ADD COMMENTlink modified 2.4 years ago by Biostar ♦♦ 20 • written 2.5 years ago by dp0b60

Have a look at wiggletools.

ADD REPLYlink modified 2.5 years ago by genomax89k • written 2.5 years ago by Devon Ryan96k
4
gravatar for Pierre Lindenbaum
2.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

for fun, using awk+sqlite3, window size=100:

 rm -f tmp.db && awk 'BEGIN {printf("create table T(C TEXT,S INT,E INT,V INT); BEGIN TRANSACTION;\n");}{printf("INSERT INTO T(C,S,E,V) VALUES(\"%s\",%s,%d,%s);\n",$1,$2,$3,$4);} END {printf("COMMIT; SELECT C,(S/100)*100 as G,((S/100)+1)*100,AVG(V) FROM T GROUP BY C,G;");}' input.bed | sqlite3 -separator $'\t' tmp.db && rm -f tmp.db

chr1    100 200 1890.0
chr1    200 300 260.0
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Pierre Lindenbaum130k
2

You are a madman, Pierre!

ADD REPLYlink written 2.5 years ago by Joe18k
2

Now this is a funny definition of fun.

ADD REPLYlink written 2.5 years ago by h.mon31k

Thanks so much but getting the following error when I try it. Any ideas?

Error: near line 1: table T already exists
Error: near line 1502: cannot commit - no transaction is active
ADD REPLYlink written 2.5 years ago by dp0b60

ah sorry, typo in my command I forgot to change the name of my db when I've copied+paste. it should be tmp.db instead of jeter.db . I'll fix;

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum130k

Thank you! Last question now I swear, piping that screen to an output file, the usual > output.txt after the code remains empty also tried |tee. Dont want to edit the script as its way above my capabilities!

ADD REPLYlink written 2.5 years ago by dp0b60
1

Where are you adding the redirect? It will need to be after tmp.db but before && rm -f...

ADD REPLYlink written 2.5 years ago by Joe18k

Got it!!Thank you!!!

ADD REPLYlink written 2.5 years ago by dp0b60

Seriously love this code,edit it for max,min,average values.Different distance. Works so fast too. Honestly thank you!

ADD REPLYlink written 2.5 years ago by dp0b60
4
gravatar for ATpoint
2.5 years ago by
ATpoint38k
Germany
ATpoint38k wrote:

Use bedtools map:

bedtools map -a windows.bed -b seqdata.bed -c 4 -o mean > bins_averaged.txt

-c 4 tells bedtools to perform the averaging operation (-o mean) on the 4th column of -b

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by ATpoint38k
1

Even though Pierre has created a strange little masterpiece up there, I think this is going to be the best answer for most people :)

ADD REPLYlink written 17 months ago by Chris Miller21k

Works very well indeed. If I may add, for others, that one can get 'smoothed' (overlapping) windows by doing the following:

bedtools makewindows \
  -w 1000 -s 500 \
  -g /ReferenceMaterial/1000Genomes/hg19/human_g1k_v37.fasta.fai > human_g1k_v37_windows.bed ;

bedtools map \
  -a human_g1k_v37_windows.bed \
  -b GSM3444988_K562_DMSO.bed \
  -c 4 -o mean > GSM3444988_K562_DMSO_Smoothed.bed ;
ADD REPLYlink written 5 months ago by Kevin Blighe65k
3
gravatar for Alex Reynolds
2.5 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

You can use Kent utilities fetchChromSizes to get the chromosome bounds of your genome of interest, e.g. hg38:

$ fetchChromSizes hg38 | awk '($0!~"_"){ print $1"\t0\t"$2; }' | sort-bed - > hg38.bed

You can then use BEDOPS bedops --chop and bedmap --mean to generate 1000-nt windows and score the mean signal of signal.bedgraph elements that overlap those windows, resp.:

$ bedmap --echo --mean --delim '\t' <(bedops --chop 1000 hg38.bed) <(awk -vOFS="\t" '{print $1, $2, $3, ".", $4; }' signal.bedgraph | sort-bed -) > answer.bed

There are several score operators, other than --mean, such as --median, --kth, --stdev, --wmean, etc. See bedmap --help for a full listing.

By using process substitutions, this should run pretty quickly. This can also be parallelized with bedops --chrom <chr> and bedmap --chrom <chr> if you are working with very large (whole-genome scale) datasets.

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Alex Reynolds30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1986 users visited in the last hour