Question: How to get sample names based on genotype from multi-sample vcf file
0
gravatar for hellbio
3.9 years ago by
hellbio370
hellbio370 wrote:

Hi,

I have multi-sample vcf file and an example variant is shown below:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    03-071    04-051    04-071    06-044    07-085    10-009

chr1    6526093    .    T    C    197.77    .    AC1=1;AC=1;AF1=0.5    GT:GQ:DP:PL:AD    0/1    1/1    0/0    1/1    1/1    0/1 

 

For each variant i would need to retrieve the sample names based on genotype. If the genotype is "0/1"  it should output the first 5 columns and the sample names in the 6th column.

chr1    6526093    .    T    C    03-071,10-009    

If the genotype is "1/1" it should output:

chr1    6526093    .    T    C    04-051,06-044,07-085

 

The original file has >500 sample and i would need to get the output in the above format. Are there any tools which can do this to some extent and further tweaking to get the desire output format?

vcf • 1.7k views
ADD COMMENTlink modified 3.9 years ago by Pierre Lindenbaum123k • written 3.9 years ago by hellbio370
0
gravatar for Pierre Lindenbaum
3.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

using my tool BioAlcidae https://github.com/lindenb/jvarkit/wiki/BioAlcidae

 

the script:

while(iter.hasNext())
    {
    var ctx = iter.next();
    for(var i = 0;i< ctx.getNSamples();++i)
        {
        var g = ctx.getGenotype(i);        
        out.print(ctx.contig);
        out.print("\t");
        out.print(ctx.start);
         out.print("\t");
        out.print(ctx.getID());
        out.print("\t"+g.getSampleName());
        var alleles = g.getAlleles();
        for(var j=0;j< alleles.size();++j)
            {
            out.print("\t"+alleles.get(j).getDisplayString());
            }
         out.println();
        }
    }

 

the vcf:

##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1,length=249250621,assembly=hg19>
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    S1    S2    S3    S4    S5    S6    S7    S8
chr1    1    .    T    A,C    .    .    .    GT    0/0    0/1    1/1    1/2    2/2    0/2    ./.    2/1

 

run

java -jar dist-1.139/bioalcidae.jar -f script.js input.vcf

chr1    1    .    S1    T    T
chr1    1    .    S2    T    A
chr1    1    .    S3    A    A
chr1    1    .    S4    A    C
chr1    1    .    S5    C    C
chr1    1    .    S6    T    C
chr1    1    .    S7        
chr1    1    .    S8    C    A

 

 

ADD COMMENTlink written 3.9 years ago by Pierre Lindenbaum123k

Thanks. This looks like printing each variant for each sample which would be very large output especially for WGS vcf file with 500 samples as it will print 500 rows for 1 variant. Instead, it would be helpful if it could print only the variant in single row and add additional columns fwith sample names to represent  hom/het variants.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by hellbio370

so the script:

while(iter.hasNext())
    {
    var ctx = iter.next();
    var samples={};
    for(var i = 0;i< ctx.getNSamples();++i)
        {
        
        var gi = ctx.getGenotype(i);
        if( gi.getSampleName() in samples ) continue;
        out.print(ctx.contig);
        out.print("\t");
        out.print(ctx.start);
         out.print("\t");
        out.print(ctx.getID());
        out.print("\t");
        out.print(gi.getSampleName());
        samples[gi.getSampleName()]=1;
        for(var j=i+1;j<ctx.getNSamples();++j)
            {
            var gj = ctx.getGenotype(j);
            if( !(gi.sameGenotype(gj))) continue;
            samples[gj.getSampleName()]=1;
            out.print(";"+gj.getSampleName());
            }
        var alleles = gi.getAlleles();
        for(var j=0;j< alleles.size();++j)
            {
            out.print("\t"+alleles.get(j).getDisplayString());
            }
         out.println();
        }
 
    }

output:

chr1    1    .    S1    T    T
chr1    1    .    S2    T    A
chr1    1    .    S3    A    A
chr1    1    .    S4;S8    A    C
chr1    1    .    S5    C    C
chr1    1    .    S6    T    C
chr1    1    .    S7        

 

 

 

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum123k

Thanks for you time. Could we achieve it this way.

chr1    1    .    S2,S4,S6,S8   S3,S5 

The script need not bother about the multiple alternate alleles. It only has to consider 0/1 or 1/1 or 1/2 or 2/2. The above output has heterozygote sample names in 4th column and homozygote sample names in 5th column.

ADD REPLYlink written 3.9 years ago by hellbio370

Could we just save the updated script and run or should we download the tool and install?

ADD REPLYlink written 3.9 years ago by hellbio370

uh ??             .

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum123k

this is just a simple loop. I let this as an exercise...

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum123k

I tried to download and install but unsuccessful.

$ git clone "https://github.com/lindenb/jvarkit.git"

cd jvarkit/

make bioalcidae

Terminated with the below error:

(cd /Users/aru/Downloads/jvarkit/htsjdk-1.139 && ant )
/bin/bash: ant: command not found
make: *** [/Users/aru/Downloads/jvarkit/htsjdk-1.139/dist/htsjdk-1.139.jar] Error 127

Could you help to work through the tool.

ADD REPLYlink written 3.9 years ago by hellbio370

ant is required https://github.com/lindenb/jvarkit/wiki/Compilation#requirements--dependencies

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum123k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1249 users visited in the last hour