Question: Number of SNPs in VCF
0
gravatar for bwczech
7 days ago by
bwczech0
bwczech0 wrote:

Hey,

I have multisample VCF contains a information about 7 breed (in sum about 1000 individuals). Their ID are: SIM001, SIM002, ..., SIM034, FLV001, FLV002, FLV034 ...

I need to obtain a information about sum SNPs for every breed (SIMXXX is a one breed, FLV is a second breed and so on...).

Have you got a information how can do that? Thank you.

snp vcftools • 184 views
ADD COMMENTlink modified 5 days ago by Russ160 • written 7 days ago by bwczech0

"... about sum SNPs ... "

it's not clear to me. Are you taking about variants ? SNPs ? genotypes, called-only genotypes ? non hom-ref genotypes ?

ADD REPLYlink written 7 days ago by Pierre Lindenbaum102k

Sorry, my mistakes. I have only SNPs in my file. So I need. a total number of variants for each breeds (excluding 100% missing for breed). My mistake. Now is it clear?

ADD REPLYlink written 7 days ago by bwczech0

Now is it clear?

no. The number of variants (=rows) doesn't change with the number of samples/genotypes (=columns). So the "number of variants" will not change.

ADD REPLYlink modified 7 days ago • written 7 days ago by Pierre Lindenbaum102k

hmm, how to explain it... I just need to get a information about total number of variants for every breed.

ADD REPLYlink written 7 days ago by bwczech0

Do you mean the histogram of SNPs in VCF?

ADD REPLYlink written 6 days ago by Genebow120
0
gravatar for dyollluap
5 days ago by
dyollluap300
USA, California, Bay Area
dyollluap300 wrote:

I think the easiest option is to grep each breed and output the count:

grep 'SIM001' mymultisample.vcf | wc

If you have a list of breeds in your multisample vcf, create a for loop:

for breed in $(<listofbreeds.txt); do grep '$breed' mymultisample.vcf |wc ; done
ADD COMMENTlink written 5 days ago by dyollluap300

I don't understand your command, the word SIM001 will be only displayed in the CHROM header like a sample isn't it ?

ADD REPLYlink written 5 days ago by Pierre Lindenbaum102k

Based on previous q&a on the comments, there's an assumption of non-perfect compliance to vcf specification. My guess/interpretation without seeing any example lines, the breed is labeled in each row and the OP wants a count / sub-total of variants for each breed.

ADD REPLYlink written 5 days ago by dyollluap300
0
gravatar for Russ
5 days ago by
Russ160
Ontario Veterinary College, University of Guelph, Guelph, Ontario, Canada
Russ160 wrote:

So if I understand correctly, you have a VCF file with about 1000 samples that represent 7 different breeds of animal. Each sample is named in a consistent fashion, with a prefix (e.g. "FLV" or "SIM") and then a number.

You could use GATKs SelectVariant tool to subset your VCF into 7 files, one for each breed, by using either the -select_expressions or the -sample_file parameters. I'm not super awesome with regex, but perhaps the following would work:

java -jar GenomeAnalysisTK.jar -R <reference.fa> -T SelectVariants -V <vcf_file> -se 'SIM.+' -o sim.vcf

Alternatively, it may just be easier to use the sample_file flag - make a file for each breed that has a list of the sample names, and then pass that to GATK:

 java -jar GenomeAnalysisTK.jar -R <reference.fa> -T SelectVariants -V <vcf_file> -sf <list_of_all_SIM_samples.txt> -o sim.vcf

And then just repeat for each breed.

ADD COMMENTlink modified 5 days ago • written 5 days ago by Russ160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1428 users visited in the last hour