Question: group samples based on shared mutations in a single multi samples vcf file
2
gravatar for ahmedakhokhar
5 months ago by
ahmedakhokhar110
Belgium
ahmedakhokhar110 wrote:

Hi, I am learning about vcf file formatting and sorry for asking a dumb question.

I have a multi-sample (>300) vcf file and I want to group (take) samples with shared mutations/GT.

Can someone suggest a tool/command/script solve this problem?

Thanks

python shell R vcf • 319 views
ADD COMMENTlink modified 5 months ago by Pierre Lindenbaum120k • written 5 months ago by ahmedakhokhar110

FYI: Cross posted on stackexchange.

ADD REPLYlink written 5 months ago by finswimmer11k

FYI: I asked for help because I am learning and got myself stuck.

ADD REPLYlink modified 5 months ago • written 5 months ago by ahmedakhokhar110

Hello ahmedakhokhar ,

I'm sorry if you find my comment harsh. But it was just the information to other people, who might want to help you, that you asked the exact question somewhere else.

It is frustrating to write a solution and see afterwards that your problem is already solved. This is why it is necessary to tell people where you asked your question as well.

ADD REPLYlink written 5 months ago by finswimmer11k

No worries, I understand.

ADD REPLYlink written 5 months ago by ahmedakhokhar110

what is group(take) ? what is "shared mutations/GT." ?

I don't understand. Please provide a minimal example of input+output.

ADD REPLYlink written 5 months ago by Pierre Lindenbaum120k

Explanation

shared mutations = the mutation positions present (shared) among different samples since not all mutations are present in all the samples. And, get an output file with a list of samples (name and count) with POS, COUNT and GT info.

DATA

"input.vcf" file is a normal vcf file with >300 samples as follow:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 ........SAMPLE300

Expected "output.txt" of a list of samples (count) with POS and GT info as follow:

POS GT COUNT SAMPLE

41230 0/1  6 SAMPLE3 SAMPLE4 SAMPLE57 SAMPLE90 SAMPLE17 SAMPLE49...

88099 1/0  4 SAMPLE13 SAMPLE40 SAMPLE5 SAMPLE101...

56737 1/1  5 SAMPLE11 SAMPLE414 SAMPLE7 SAMPLE90 SAMPLE78...

Thanks, please let me know if it's still not clear.

ADD REPLYlink modified 5 months ago by Pierre Lindenbaum120k • written 5 months ago by ahmedakhokhar110

what if some samples are 0/1 and 1/1 on the same variant ?

it sounds like a xyz problem... what's your final aim ?

ADD REPLYlink written 5 months ago by Pierre Lindenbaum120k

samples with 0/1 and 1/1 for the same positions with go as a separate entry but if I can get info of 1/1 only, that would be sufficient.

The final aim is to know which samples share which mutation (position) and to separate them into groups.

ADD REPLYlink written 5 months ago by ahmedakhokhar110
1
gravatar for finswimmer
5 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

I dont't know if this is exactly what you want.

For each genotype you are interested run this:

$ bcftools query -i 'GT="0/1"' -f '%CHROM\t%POS\t0/1\t%REF\t%ALT\t[%SAMPLE,]\n' input.vcf|sed 's/,$//'|awk -v OFS="\t" '{counts = split($6, samples, ","); print $1, $2, $3, counts, $4, $5, $6}' > stats_01.csv

In this example bcftools query looks for samples with genotype 0/1 and print out CHROM, POS, 0/1, REF, ALT and a comma seperated list of sample names.

sed removes the additional , at the end. awk counts the number of sample names in each row and add this column.

fin swimmer

ADD COMMENTlink modified 5 months ago • written 5 months ago by finswimmer11k

Perfect, that can work nicely, one more piece of help. The sample names start with either "A_" or "B_" (followed by individual names). How can I count the sub-population of samples? Thank you very much.

ADD REPLYlink written 5 months ago by ahmedakhokhar110
1
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

 java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->{V.getGenotypes().stream().filter(G->G.isCalled() && !G.isHomRef()).collect(Collectors.groupingBy(G->G.getType(),Collectors.mapping(G->G.getSampleName(),Collectors.toSet()))).forEach((X,Y)->println(V.getContig()+" "+V.getStart()+" "+X+" "+String.join(";",Y)));});' input.vcf

RF01 970 HOM_VAR S5
RF02 251 HET S3;S2
RF02 578 HOM_VAR S4
RF02 877 HET S1
RF02 1726 HET S3;S2
RF02 1962 HET S1
RF03 1221 HOM_VAR S3;S2
RF03 1242 HOM_VAR S4
RF03 1688 HOM_VAR S5
RF03 1708 HOM_VAR S5
RF03 2150 HET S1
RF03 2201 HET S3;S2
RF03 2315 HET S3;S4;S2
RF03 2573 HOM_VAR S3;S2
RF04 887 HET S1
RF04 991 HET S4
RF04 1241 HOM_VAR S4
RF04 1259 HET S4
RF04 1857 HET S1
RF04 1900 HOM_VAR S5
RF04 1920 HOM_VAR S1
RF05 41 HOM_VAR S4
RF05 499 HOM_VAR S3;S2
RF05 795 HET S4
RF05 879 HOM_VAR S3;S2
RF05 1297 HOM_VAR S3;S2
RF05 1339 HET S3;S2
RF06 517 HET S1
RF06 543 HOM_VAR S3;S2
RF06 668 HOM_VAR S5
RF06 695 HET S4
RF06 1129 HOM_VAR S4
RF07 98 HET S3;S2
RF07 225 HOM_VAR S3;S2
(...)
ADD COMMENTlink modified 5 months ago • written 5 months ago by Pierre Lindenbaum120k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1633 users visited in the last hour