Question: Number of SNPs in VCF
0
gravatar for bwczech
11 weeks ago by
bwczech0
bwczech0 wrote:

Hey,

I have multisample VCF contains a information about 7 breed (in sum about 1000 individuals). Their ID are: SIM001, SIM002, ..., SIM034, FLV001, FLV002, FLV034 ...

I need to obtain a information about sum SNPs for every breed (SIMXXX is a one breed, FLV is a second breed and so on...).

Have you got a information how can do that? Thank you.

snp vcftools • 254 views
ADD COMMENTlink modified 11 weeks ago by Russ200 • written 11 weeks ago by bwczech0

"... about sum SNPs ... "

it's not clear to me. Are you taking about variants ? SNPs ? genotypes, called-only genotypes ? non hom-ref genotypes ?

ADD REPLYlink written 11 weeks ago by Pierre Lindenbaum104k

Sorry, my mistakes. I have only SNPs in my file. So I need. a total number of variants for each breeds (excluding 100% missing for breed). My mistake. Now is it clear?

ADD REPLYlink written 11 weeks ago by bwczech0

Now is it clear?

no. The number of variants (=rows) doesn't change with the number of samples/genotypes (=columns). So the "number of variants" will not change.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Pierre Lindenbaum104k

hmm, how to explain it... I just need to get a information about total number of variants for every breed.

ADD REPLYlink written 11 weeks ago by bwczech0

Do you mean the histogram of SNPs in VCF?

ADD REPLYlink written 11 weeks ago by genebow120
0
gravatar for dyollluap
11 weeks ago by
dyollluap300
USA, California, Bay Area
dyollluap300 wrote:

I think the easiest option is to grep each breed and output the count:

grep 'SIM001' mymultisample.vcf | wc

If you have a list of breeds in your multisample vcf, create a for loop:

for breed in $(<listofbreeds.txt); do grep '$breed' mymultisample.vcf |wc ; done
ADD COMMENTlink written 11 weeks ago by dyollluap300

I don't understand your command, the word SIM001 will be only displayed in the CHROM header like a sample isn't it ?

ADD REPLYlink written 11 weeks ago by Pierre Lindenbaum104k

Based on previous q&a on the comments, there's an assumption of non-perfect compliance to vcf specification. My guess/interpretation without seeing any example lines, the breed is labeled in each row and the OP wants a count / sub-total of variants for each breed.

ADD REPLYlink written 11 weeks ago by dyollluap300
0
gravatar for Russ
11 weeks ago by
Russ200
Ontario Veterinary College, University of Guelph, Guelph, Ontario, Canada
Russ200 wrote:

So if I understand correctly, you have a VCF file with about 1000 samples that represent 7 different breeds of animal. Each sample is named in a consistent fashion, with a prefix (e.g. "FLV" or "SIM") and then a number.

You could use GATKs SelectVariant tool to subset your VCF into 7 files, one for each breed, by using either the -select_expressions or the -sample_file parameters. I'm not super awesome with regex, but perhaps the following would work:

java -jar GenomeAnalysisTK.jar -R <reference.fa> -T SelectVariants -V <vcf_file> -se 'SIM.+' -o sim.vcf

Alternatively, it may just be easier to use the sample_file flag - make a file for each breed that has a list of the sample names, and then pass that to GATK:

 java -jar GenomeAnalysisTK.jar -R <reference.fa> -T SelectVariants -V <vcf_file> -sf <list_of_all_SIM_samples.txt> -o sim.vcf

And then just repeat for each breed.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Russ200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 571 users visited in the last hour