Question: Common SNPs among all populations
0
gravatar for mostafarafiepour
2.2 years ago by
mostafarafiepour110 wrote:

Hi all Dear,

I have a VCF file that contains approximately 20 millions SNPs. And my VCF file has 5 populations (KHUZ, MAZ, WAZ, EAZ and GIL), each population has 10 samples. So, I want to know how many SNPs are common between the 5 populations?

Best Regard

snp vcf • 1.2k views
ADD COMMENTlink modified 2.2 years ago by zx87549.9k • written 2.2 years ago by mostafarafiepour110

Take a look at VCFtools

Usage guide

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Bastien Hervé4.9k

Many thanks for your reply,

I saw the page you shared. And the only thing I can find is the option --exclude-positions-overlap. Can you guide me more precisely?

ADD REPLYlink written 2.2 years ago by mostafarafiepour110

In the first link you have different tools, as vcf-isec

Creates intersections and complements of two or more VCF files. Given multiple VCF files, it can output the list of positions which are shared by at least N files, at most N files, exactly N files, etc

See also : Obtain one vcf file of shared SNPs from input files with different samples using vcf-isec (vcftools)

ADD REPLYlink written 2.2 years ago by Bastien Hervé4.9k
1
gravatar for finswimmer
2.2 years ago by
finswimmer14k
Germany
finswimmer14k wrote:

Assuming the Samples 1-10 are from population 1, 11-20 from population 2 and 21-30 from population 3, you can filter for variant sites, where at least one sample in each population has a non-reference allel, using bcftools view like this:

$ bcftools view -i 'GT[0-9]="alt" && GT[10-19]="alt" && GT[20-29]="alt"' input.vcf

fin swimmer

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by finswimmer14k

I run your script, but I encountered this error.

[filter.c:1379 filters_init1] Error: the tag "INFO/GT[1" is not defined in the VCF header

this is header my vcf file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  EAZ01   EAZ02   EAZ03   EAZ04   EAZ06   EAZ08   EAZ20   EAZ21   EAZ25   EAZ26   GIL01   GIL03   GIL04   GIL05   GIL06   GIL08   GIL11 GIL14    GIL15   GIL17   KHU01   KHU02   KHU04   KHU05   KHU06   KHU07   KHU09   KHU12   KHU15   KHU17   MAZ01   MAZ02   MAZ03   MAZ04   MAZ05   MAZ08   MAZ10   MAZ12   MAZ13   MAZ19   WAZ01   WAZ02   WAZ03 WAZ04    WAZ05   WAZ06   WAZ07   WAZ08   WAZ12   WAZ16
ADD REPLYlink written 2.2 years ago by mostafarafiepour110

Which version are you using? The syntax I described was introduced in v1.8. Current release is v1.9.

ADD REPLYlink written 2.2 years ago by finswimmer14k

I use bcftools/1.3.1 version. I only have this version on the server.

ADD REPLYlink written 2.2 years ago by mostafarafiepour110
2

Have a look at bioconda. This enables you to install the newest version (and a lot of other bioinformatic tools) even without root permissions. Also the first part of this tutorial I'v written might be useful for you.

fin swimmer

ADD REPLYlink written 2.2 years ago by finswimmer14k

I installed the suggested version, but when I run the script, it does not give me the output. And it only displays the content of my VCF file in a graphical and fast way. My expectation is that when i run the script, i will give the number of common PSNs between 5 populations.

i code run:

bcftools view -i 'GT[1-10]="alt" && GT[11-20]="alt" && GT[21-30]="alt" && GT[31-40]="alt" && GT[41-50]="alt"' Final.vcf
ADD REPLYlink written 2.2 years ago by mostafarafiepour110
1

What you see is the filtered vcf. One could save this output to a new file using > file.vcf at the end of the command.

If you don't want this and you are only interested in the number of variants you can count the lines excluding the header lines.

$ bcftools view -i 'GT[0-9]="alt" && GT[10-19]="alt" && GT[20-29]="alt" && GT[30-39]="alt" && GT[40-50]="alt"' Final.vcf|grep -v "^#"|wc -l

fin swimmer

PS: Be aware of the counting of the sample numbers. It's starting from 0!

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by finswimmer14k

This code works well but with a small change.

$ bcftools view -i 'GT[0-9]="alt" && GT[10-19]="alt" && GT[20-29]="alt" && GT[30-39]="alt" && GT[40-49]="alt"' Final.vcf|grep -v "^#"|wc -l
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by mostafarafiepour110
0
gravatar for Pierre Lindenbaum
2.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

using vcffilterjdk http://lindenb.github.io/jvarkit/VcfFilterJdk.html

an example with 3 populations

select.code:

final String POP1[]={"S1"};
final String POP2[]={"S2","S3"};
final String POP3[]={"S4","S5"};

final String ALL_POPS[][]={POP1,POP2,POP3};

for(String pop[]:ALL_POPS) {
    if(Arrays.stream(pop).
        map(SAMPLE->variant.getGenotype(SAMPLE)).
        noneMatch(G->G.isHet() || G.isHomVar())) return false;
    }
return true;

usage:

java -jar dist/vcffilterjdk.jar -f select.code input.vcf
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Pierre Lindenbaum134k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2743 users visited in the last hour
_