Question

Create 10,000bp windows for a SNP file and assign each SNP to its respective window

0

Entering edit mode

7 months ago

nitinra ▴ 50

Hello all,

I have a tab file with 700k SNPs (results from BayPass analysis of p-values for Genotype x environment association). I want to create 10k bp window intervals for each of the chromosome/contig and assign each SNP to its corresponding window.

This is the file I have (sample):

Chromosome     SNP        bf.db
Chr_1                 2007         -22.5
Chr_1                 7200        16.05
Chr_1                11500         12.05
Chr_1                 18000        61.05

This is what I need:

Chromosome     SNP        bf.db.     window
Chr_1                 2007         -22.5     0-10000
Chr_1                 7200        16.05     0-10000
Chr_1                11500         12.05  10001-20000
Chr_1                 18000        61.05  10001-20000

I can create windows using bedtools makewindows, but I do not know how to assign each SNP to it's corresponding window.

TIA!

genome snp vcf • 744 views

ADD COMMENT • link updated 7 months ago by Alex Reynolds 35k • written 7 months ago by nitinra ▴ 50

score 4 · Accepted Answer · 2023-09-18

4

Entering edit mode

7 months ago

Pierre Lindenbaum 161k

I can create windows using bedtools makewindows, but I do not know how to assign each SNP to it's corresponding window.

convert the first file to bed using awk something like /(no tested):

awk '/^Chromosome/ {next;} {printf("%s\t%d\t%s\t%s\n",$1,int($2)-1,$2,$3);}' < file1.tsv

and then combine both files using bedtools intersect

ADD COMMENT • link 7 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Pierre Lindenbaum im not ashamed to admit i came to this thread just so i could see how to do it in a one-liner with awk

ADD REPLY • link 7 months ago by LauferVA 4.2k

score 2 · Accepted Answer · 2023-09-18

Using BEDOPS, starting with a VCF-formatted file of SNP variants over hg38, this is a one-liner that writes data to answer.bed:

bedmap --echo --echo-map <(fetchChromSizes hg38 | awk -vFS="\t" -vOFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 10000 -) <(vcf2bed < in.vcf) > answer.bed

In separate lines, here are the separated commands:

fetchChromSizes hg38 | awk -vFS="\t" -vOFS="\t" '{ print $1, "0", $2 }' | sort-bed - | bedops --chop 10000 - > windows.10kb.bed
vcf2bed < variants.vcf > variants.bed
bedmap --echo --echo-map windows.10kb.bed variants.bed > answer.bed

The one-liner will run faster, but the separated-command approach is easier to read and modify.

If you don't have variants in VCF format, get them into VCF format, or use an awk statement as described in another answer. If you use UCSC fetchChromSizes, you'll need to modify the format of the chromosome name to UCSC standard (chr1, chr2, etc.) for the mapping in bedmap to work correctly.