Mince a vcf into n bins of a given range
2
0
Entering edit mode
4 months ago
AndrMod • 0

I have a vcf file containing SNPs from alla across a genome and thanks to the CHROM field I can split it into its various chromosomes; each SNP's POS is already relative to its chromosome position (indeed).

I want to split each chromosome in bins (windows) of 100,000 bp, thus obtaining some bins more enriched of my SNPs while other almost empty or straight empty. eg. :

object1: CHROM1
CHROM 1, bin 1 [1 - 100,000 bp] = 0 SNPs
CHROM 1, bin 2 [100,001 - 200,000 bp] = 12 SNPs (names of the SNPs)
...

object2: CHROM 2
CHROM 2, bin 1 [1 - 100,00] = 4 SNPs (names of the SNPs)


I managed to concot a manual approach on R. It has the good feature that each binned chromosome is a list of bins, each containing the SNP name and POS; however, it's very slow: it requires individual manipulation both of the chromosomes and the ranges.

I wonder if there is a faster way,be it by R or other tools, to bin chromosomes by predefinite ranges and obtainig bin easily attributable to their chromosomes.

So far I investigated various tools, but mostly they proceed the other way - putting n SNPs per bin, so that very unlinked SNPs may be binned together.

I will appreciate any light cast on this topic. Thanks to everybody.

vcf binning • 488 views
ADD COMMENT
0
Entering edit mode

If you have range in bed format, I think bedtools intersect with count will help you. It takes vcf and bed format for input.

ADD REPLY
0
Entering edit mode

I haven't the ranges in any format, but I suppose I can automate-write one of such files via R. Thanks!

ADD REPLY
0
Entering edit mode
4 months ago

so a solution could be:

i=0; java -jar dist/vcf2intervals.jar -D 1m --bed indexed.vcf.gz | awk  '{printf("%s:%d-%s\n",$1,int($2)+1,$3);}' | while read R; do bcftools view -o "out.${i}.vcf.gz" -O z --regions "${R}" indexed.vcf.gz; i=$((i+1)); done

ADD COMMENT
0
Entering edit mode
4 months ago

You can make bins with BEDOPS bedops --chop and Kent utilities fetchChromSizes:

$fetchChromSizes hg38 | grep -v '_*_' | awk -v FS="\t" -v OFS="\t" '{ print$1, "0", $2 }' | sort-bed - | bedops --chop 100000 - > hg38.100kb.bed  Replace hg38 with your reference genome name, if different. Once you have bins, you can map them to your VCF file via vcf2bed: $ bedmap --echo --count --delim '\t' hg38.100kb.bed <(vcf2bed --sort-tmpdir=\${PWD} --max-mem=2G < variants.vcf) > answer.bg


The file answer.bg is a bedGraph file: four columns representing genomic position and counts of variants over each bin. This can be easily imported into R.

ADD COMMENT
0
Entering edit mode

Thanks, this seems to work for the overall binning! I just wonder if there is a way to recover the names of the SNPs stored in each bin.

ADD REPLY
0
Entering edit mode

In the bedmap command, add --echo-map-id after --echo or --count, depending on where you want the names in the output. See bedmap --help for a full listing of options.

ADD REPLY

Login before adding your answer.

Traffic: 2073 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6