Question

sliding windows over genome

0

Entering edit mode

23 months ago

gubrins ▴ 290

Heys,

I am interested in doing 100K and 500K sliding windows along a reference genome I have.

In the past, I managed to do it selecting columns 1 and 2 from the fai file and later with bedtools makewindows:

cut -f1,2 reference.fai > sizes.genome
bedtools makewindows -g sizes.genome -w 1000000 > 1MB.txt

But this is not working, as it makes the windows at the chromosome level. The file I would like to obtain is like this one:

1-1000000
1000001-2000000
2000001-3000000
3000001-4000000
...
2325000001-2326000000
2326000001-2327000000
2327000001-2328000000
2328000001-2329000000

How can I do it? Thanks in advance!

bash • 1.8k views

ADD COMMENT • link 23 months ago by gubrins ▴ 290

0

Entering edit mode

windows at the chromosome level

why is this not what you want, or, what other way would you want it to work?

ADD REPLY • link 23 months ago by cmdcolin ★ 3.8k

0

Entering edit mode

I would like to have a file with the last sliding windows being the size of all the genome.

ADD REPLY • link 23 months ago by gubrins ▴ 290

score 0 · Answer 1 · 2022-05-17

0

Entering edit mode

23 months ago

Alex Reynolds 35k

Assuming your reference genome is hg38 (replace it with the key for your assembly of choice), you can use bedops --chop and awk pretty easily:

$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 100000 - | awk -v FS="\t" -v OFS="\t" '{ print $2+1"-"$3 }' > 100k.1based.txt

The last window for each nuclear chromosome will almost certainly be less than 100k nt in size.

If you don't want that straggler to be in your text file, you can excise it by adding -x to bedops --chop, i.e., :

$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 100000 -x - | awk -v FS="\t" -v OFS="\t" '{ print $2+1"-"$3 }' > 100k.1based.noStragglingBin.txt

Further, you will have bins for each chromosome, which will lead to duplicates. It is not clear from your question how you intend to handle that. If you just want something for the largest chromosome, use awk to filter on that chromosome:

$ fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '($1 == "chr1"){ print $1, "0", $2 }' | bedops --chop 100000 - | awk -v FS="\t" -v OFS="\t" '{ print $2+1"-"$3 }' > 100k.chr1.1based.txt

Learn to do things with standard input and output, it will make your life easier and your work faster!

Also, not to be pedantic, but these are not typically called sliding windows. Sliding would step over the genome at increments, leading to overlapping bins. These are disjoint windows. I'm mentioning this terminology as it can help you better communicate your experiment to others.

Hope this helps!

ADD COMMENT • link 23 months ago by Alex Reynolds 35k

0

Entering edit mode

Thanks for your answer! Is this method going to work with a non-model genome assembly? Not sure if my genome is as available as the human genome. Regarding the correction, disjoint windows is the same as non-overlapping windows? I usually use the later (it is true that I did not do it here), but now I am not sure if this term is also correct.

Thanks!

ADD REPLY • link 23 months ago by gubrins ▴ 290

0

Entering edit mode

Non-overlapping windows could have gaps between them. Disjoint windows do not overlap but do not have gaps. Your problem describes disjoint windows.

If you do not have a model assembly, you would need a chromosome sizes file from somewhere or you would make one from the data that you have. Feel free to describe what you have.

ADD REPLY • link 23 months ago by Alex Reynolds 35k

0

Entering edit mode

Thansk for your reply. I have this file:

NC_018723.3 242100913
NC_018724.3 171471747
NC_018725.3 143202405
NC_018726.3 208212889
NC_018727.3 155302638
NC_018728.3 149751809
NC_018729.3 144528695
NC_018730.3 222790142
NC_018731.3 161193150
NC_018732.3 117648028
NC_018733.3 90186660
NC_018734.3 96884206
NC_018735.3 96521652
NC_018736.3 63494689
NC_018737.3 64340295
NC_018738.3 44648284
NC_018739.3 71664243
NC_018740.3 85752456

And before with bedtools makewindows I was able to obtain disjoint windows from the first genomic position to the last (at the genome level) but now when I do it, I just obtain it at the chromosome level.

ADD REPLY • link 23 months ago by gubrins ▴ 290

score 0 · Answer 2 · 2022-05-17

0

Entering edit mode

23 months ago

cmdcolin ★ 3.8k

it seems like a workflow might be samtools faidx yourfile.fa; bedtools makewindows -w 1000 -g yourfile.fa.fai

ADD COMMENT • link 23 months ago by cmdcolin ★ 3.8k

0

Entering edit mode

that's at the chromosome level, I want a file with 1 as the first position of the first chromosome and the last position being the total length of the genome.

ADD REPLY • link 23 months ago by gubrins ▴ 290