Question: Bin chromosome every 1kb and get average value
1
gravatar for dp0b
20 months ago by
dp0b40
dp0b40 wrote:

Hi,

I have a large file of each chromosome for sequence data and I want to get the average score in 1kb bins? Is there an easy way to do to this?

chr1 123 127 1890
chr1 208 210 90
chr1 215 220 50
chr1 225 230 100
chr1 250 300 800

I have tried bedtools, it will create the windows but then how do I get the average of the fourth value in that window? It also seems to be quite slow.

bedtools makewindows -w 1000 -g chromosome.txt > windows.txt

Any help would be much appreciated.

ADD COMMENTlink modified 19 months ago by Biostar ♦♦ 20 • written 20 months ago by dp0b40

Have a look at wiggletools.

ADD REPLYlink modified 20 months ago by genomax75k • written 20 months ago by Devon Ryan93k
4
gravatar for Pierre Lindenbaum
20 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

for fun, using awk+sqlite3, window size=100:

 rm -f tmp.db && awk 'BEGIN {printf("create table T(C TEXT,S INT,E INT,V INT); BEGIN TRANSACTION;\n");}{printf("INSERT INTO T(C,S,E,V) VALUES(\"%s\",%s,%d,%s);\n",$1,$2,$3,$4);} END {printf("COMMIT; SELECT C,(S/100)*100 as G,((S/100)+1)*100,AVG(V) FROM T GROUP BY C,G;");}' input.bed | sqlite3 -separator $'\t' tmp.db && rm -f tmp.db

chr1    100 200 1890.0
chr1    200 300 260.0
ADD COMMENTlink modified 20 months ago • written 20 months ago by Pierre Lindenbaum124k
2

You are a madman, Pierre!

ADD REPLYlink written 20 months ago by Joe15k
1

Now this is a funny definition of fun.

ADD REPLYlink written 20 months ago by h.mon28k

Thanks so much but getting the following error when I try it. Any ideas?

Error: near line 1: table T already exists
Error: near line 1502: cannot commit - no transaction is active
ADD REPLYlink written 20 months ago by dp0b40

ah sorry, typo in my command I forgot to change the name of my db when I've copied+paste. it should be tmp.db instead of jeter.db . I'll fix;

ADD REPLYlink written 20 months ago by Pierre Lindenbaum124k

Thank you! Last question now I swear, piping that screen to an output file, the usual > output.txt after the code remains empty also tried |tee. Dont want to edit the script as its way above my capabilities!

ADD REPLYlink written 20 months ago by dp0b40
1

Where are you adding the redirect? It will need to be after tmp.db but before && rm -f...

ADD REPLYlink written 20 months ago by Joe15k

Got it!!Thank you!!!

ADD REPLYlink written 20 months ago by dp0b40

Seriously love this code,edit it for max,min,average values.Different distance. Works so fast too. Honestly thank you!

ADD REPLYlink written 20 months ago by dp0b40
3
gravatar for ATpoint
20 months ago by
ATpoint26k
Germany
ATpoint26k wrote:

Use bedtools map:

bedtools map -a windows.bed -b seqdata.bed -c 4 -o mean > bins_averaged.txt

-c 4 tells bedtools to perform the averaging operation (-o mean) on the 4th column of -b

ADD COMMENTlink modified 20 months ago • written 20 months ago by ATpoint26k

Even though Pierre has created a strange little masterpiece up there, I think this is going to be the best answer for most people :)

ADD REPLYlink written 7 months ago by Chris Miller21k
2
gravatar for Alex Reynolds
20 months ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

You can use Kent utilities fetchChromSizes to get the chromosome bounds of your genome of interest, e.g. hg38:

$ fetchChromSizes hg38 | awk '($0!~"_"){ print $1"\t0\t"$2; }' | sort-bed - > hg38.bed

You can then use BEDOPS bedops --chop and bedmap --mean to generate 1000-nt windows and score the mean signal of signal.bedgraph elements that overlap those windows, resp.:

$ bedmap --echo --mean --delim '\t' <(bedops --chop 1000 hg38.bed) <(awk -vOFS="\t" '{print $1, $2, $3, ".", $4; }' signal.bedgraph | sort-bed -) > answer.bed

There are several score operators, other than --mean, such as --median, --kth, --stdev, --wmean, etc. See bedmap --help for a full listing.

By using process substitutions, this should run pretty quickly. This can also be parallelized with bedops --chrom <chr> and bedmap --chrom <chr> if you are working with very large (whole-genome scale) datasets.

ADD COMMENTlink modified 20 months ago • written 20 months ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 996 users visited in the last hour