Hi, I am learning about vcf file formatting and sorry for asking a dumb question.
I have a multi-sample (>300) vcf file and I want to group (take) samples with shared mutations/GT.
Can someone suggest a tool/command/script solve this problem?
Thanks
Hi, I am learning about vcf file formatting and sorry for asking a dumb question.
I have a multi-sample (>300) vcf file and I want to group (take) samples with shared mutations/GT.
Can someone suggest a tool/command/script solve this problem?
Thanks
I dont't know if this is exactly what you want.
For each genotype you are interested run this:
$ bcftools query -i 'GT="0/1"' -f '%CHROM\t%POS\t0/1\t%REF\t%ALT\t[%SAMPLE,]\n' input.vcf|sed 's/,$//'|awk -v OFS="\t" '{counts = split($6, samples, ","); print $1, $2, $3, counts, $4, $5, $6}' > stats_01.csv
In this example bcftools query looks for samples with genotype 0/1 and print out CHROM, POS, 0/1, REF, ALT and a comma seperated list of sample names.
sed removes the additional , at the end. awk counts the number of sample names in each row and add this column.
fin swimmer
using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html
java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->{V.getGenotypes().stream().filter(G->G.isCalled() && !G.isHomRef()).collect(Collectors.groupingBy(G->G.getType(),Collectors.mapping(G->G.getSampleName(),Collectors.toSet()))).forEach((X,Y)->println(V.getContig()+" "+V.getStart()+" "+X+" "+String.join(";",Y)));});' input.vcf
RF01 970 HOM_VAR S5
RF02 251 HET S3;S2
RF02 578 HOM_VAR S4
RF02 877 HET S1
RF02 1726 HET S3;S2
RF02 1962 HET S1
RF03 1221 HOM_VAR S3;S2
RF03 1242 HOM_VAR S4
RF03 1688 HOM_VAR S5
RF03 1708 HOM_VAR S5
RF03 2150 HET S1
RF03 2201 HET S3;S2
RF03 2315 HET S3;S4;S2
RF03 2573 HOM_VAR S3;S2
RF04 887 HET S1
RF04 991 HET S4
RF04 1241 HOM_VAR S4
RF04 1259 HET S4
RF04 1857 HET S1
RF04 1900 HOM_VAR S5
RF04 1920 HOM_VAR S1
RF05 41 HOM_VAR S4
RF05 499 HOM_VAR S3;S2
RF05 795 HET S4
RF05 879 HOM_VAR S3;S2
RF05 1297 HOM_VAR S3;S2
RF05 1339 HET S3;S2
RF06 517 HET S1
RF06 543 HOM_VAR S3;S2
RF06 668 HOM_VAR S5
RF06 695 HET S4
RF06 1129 HOM_VAR S4
RF07 98 HET S3;S2
RF07 225 HOM_VAR S3;S2
(...)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
FYI: Cross posted on stackexchange.
FYI: I asked for help because I am learning and got myself stuck.
Hello ahmedakhokhar ,
I'm sorry if you find my comment harsh. But it was just the information to other people, who might want to help you, that you asked the exact question somewhere else.
It is frustrating to write a solution and see afterwards that your problem is already solved. This is why it is necessary to tell people where you asked your question as well.
No worries, I understand.
what is group(take) ? what is "shared mutations/GT." ?
I don't understand. Please provide a minimal example of input+output.
Explanation
shared mutations = the mutation positions present (shared) among different samples since not all mutations are present in all the samples. And, get an output file with a list of samples (name and count) with POS, COUNT and GT info.
DATA
"input.vcf" file is a normal vcf file with >300 samples as follow:
Expected "output.txt" of a list of samples (count) with POS and GT info as follow:
Thanks, please let me know if it's still not clear.
what if some samples are 0/1 and 1/1 on the same variant ?
it sounds like a xyz problem... what's your final aim ?
samples with 0/1 and 1/1 for the same positions with go as a separate entry but if I can get info of 1/1 only, that would be sufficient.
The final aim is to know which samples share which mutation (position) and to separate them into groups.