I have multisample VCF contains a information about 7 breed (in sum about 1000 individuals). Their ID are: SIM001, SIM002, ..., SIM034, FLV001, FLV002, FLV034 ...
I need to obtain a information about sum SNPs for every breed (SIMXXX is a one breed, FLV is a second breed and so on...).
Have you got a information how can do that?
Thank you.
Sorry, my mistakes. I have only SNPs in my file. So I need. a total number of variants for each breeds (excluding 100% missing for breed). My mistake. Now is it clear?
Based on previous q&a on the comments, there's an assumption of non-perfect compliance to vcf specification. My guess/interpretation without seeing any example lines, the breed is labeled in each row and the OP wants a count / sub-total of variants for each breed.
So if I understand correctly, you have a VCF file with about 1000 samples that represent 7 different breeds of animal. Each sample is named in a consistent fashion, with a prefix (e.g. "FLV" or "SIM") and then a number.
You could use GATKs SelectVariant tool to subset your VCF into 7 files, one for each breed, by using either the -select_expressions or the -sample_file parameters. I'm not super awesome with regex, but perhaps the following would work:
Alternatively, it may just be easier to use the sample_file flag - make a file for each breed that has a list of the sample names, and then pass that to GATK:
it's not clear to me. Are you taking about variants ? SNPs ? genotypes, called-only genotypes ? non hom-ref genotypes ?
Sorry, my mistakes. I have only SNPs in my file. So I need. a total number of variants for each breeds (excluding 100% missing for breed). My mistake. Now is it clear?
no. The number of variants (=rows) doesn't change with the number of samples/genotypes (=columns). So the "number of variants" will not change.
hmm, how to explain it... I just need to get a information about total number of variants for every breed.
Do you mean the histogram of SNPs in VCF?