Question: Plotting SNP density along a chromosome from VCF files
4
gravatar for rc16955
2.3 years ago by
rc1695550
rc1695550 wrote:

Hi all,

I have generated VCF files from a range of samples of my study organism using a published reference genome. I now want to use these to assess where diversity is along a given chromosome; in other words I want to know the density of SNPs along chromosome 2.

One way I had thought of doing this was to split the chromosome into 10Kb chunks, count the number of SNPs in each one, and plot this in a chart. So what I want is basically:

Chr2:1-10000, SNPs = X Chr2:10000-20000, SNPs = Y etc.

But I can't figure out an efficient way of actually extracting this information from my VCF files.I am able to use the command

tabix -h myfile.vcf.gz chr2:[10Kregion] > myoutput.vcf

to extract a given region of any VCF file and then use bcftools stats to count the number of SNPs therein, but the chromosome is 16Mb long; there must be a more efficient way to do this than for me to manually extract 1,600 VCF files and examine them with bcftools stats.

I wonder if there's some software that could help with this. I have Integrated Genome Viewer, MEGA and SeaView, but if any of them have this kind of functionality, I haven't found it yet.

Otherwise, any help or advice would be much appreciated. It must be doable because I see plenty of published papers with SNP density plotted against chromosome position, but I can't figure out how to do it myself.

Thanks in advance!

snp genome • 5.3k views
ADD COMMENTlink modified 2.3 years ago by sacha1.7k • written 2.3 years ago by rc1695550
10
gravatar for sacha
2.3 years ago by
sacha1.7k
France
sacha1.7k wrote:

I did something similar with human genom here : https://github.com/dridk/snp_location
Briefly I create my bins ( 10000 pb) using bedtools makewindows :

bedtools makewindows -g hg19.sizes -w 10000 > windows.bed

Then I count SNP per bin using bedtools coverage

bedtools coverage -a windows.bed -b yourdata.vcf -counts > coverage.txt

Finally I plot my data using R with IdeoViz package. see https://github.com/dridk/snp_location/blob/master/plot_chrom.r .
I get the following plot showing SNP density per bin :

enter image description here

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by sacha1.7k

I have tried using bedtools coverage and get errors at several lines of the file:

WARNING: line number 10210 of file chr21.vcf.gz has an imprecise variant, but no SVLEN specified. Defaulting to length 1.

The resulting output gives a count of 0 at every window. Can anyone help with this?

ADD REPLYlink written 14 months ago by spiral0180

Hi Sacha, I also want to do something similar but I for each sample from a multisample vcf file. I tried extracting a single sample from the multi-vcf file and then followed your command but the number of variants counts for different samples remain the same (probably these are the sites where a variant has been being called across the samples). Could you guide me to extract unique variants for each individuate and then plot the variants in a bin size.

ADD REPLYlink written 5 months ago by nagarsaggi0
4
gravatar for Pierre Lindenbaum
2.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum117k wrote:

using sqlite

 curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/ALL.wgs.integrated_sv_map_v2.20130502.svs.genotypes.vcf.gz" |\
gunzip -c | grep -v "^#" |\
awk -F '\t' 'BEGIN{printf("create table T(c text,p int); BEGIN TRANSACTION;\n");} {printf("insert into T(c,p) values(\"%s\",%d);\n",$1,$2);} END {printf("COMMIT; SELECT c,CAST(p/1E7  AS INT)*1E7, count(*) from T group by c,CAST(p/1E7  AS INT)*1E7 ; drop table T;");}' |\
sqlite3 -separator ' ' tmp.sqlite  && rm tmp.sqlite

1 0.0 245
1 10000000.0 208
1 20000000.0 191
1 30000000.0 197
1 40000000.0 211
1 50000000.0 198
1 60000000.0 159
1 70000000.0 240
1 80000000.0 194
1 90000000.0 165
1 100000000.0 220
1 110000000.0 206
1 120000000.0 15
1 140000000.0 101
1 150000000.0 202
1 160000000.0 179
1 170000000.0 181
1 180000000.0 231
1 190000000.0 260
(...)
ADD COMMENTlink written 2.3 years ago by Pierre Lindenbaum117k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1392 users visited in the last hour