How to get sample names based on genotype from multi-sample vcf file
1
0
Entering edit mode
5.4 years ago
hellbio ▴ 440

Hi,

I have multi-sample vcf file and an example variant is shown below:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    03-071    04-051    04-071    06-044    07-085    10-009

chr1    6526093    .    T    C    197.77    .    AC1=1;AC=1;AF1=0.5    GT:GQ:DP:PL:AD    0/1    1/1    0/0    1/1    1/1    0/1 

 

For each variant i would need to retrieve the sample names based on genotype. If the genotype is "0/1"  it should output the first 5 columns and the sample names in the 6th column.

chr1    6526093    .    T    C    03-071,10-009    

If the genotype is "1/1" it should output:

chr1    6526093    .    T    C    04-051,06-044,07-085

 

The original file has >500 sample and i would need to get the output in the above format. Are there any tools which can do this to some extent and further tweaking to get the desire output format?

vcf • 2.3k views
ADD COMMENT
0
Entering edit mode
5.4 years ago

using my tool BioAlcidae

the script:

while(iter.hasNext())
    {
    var ctx = iter.next();
    for(var i = 0;i< ctx.getNSamples();++i)
        {
        var g = ctx.getGenotype(i);        
        out.print(ctx.contig);
        out.print("\t");
        out.print(ctx.start);
         out.print("\t");
        out.print(ctx.getID());
        out.print("\t"+g.getSampleName());
        var alleles = g.getAlleles();
        for(var j=0;j< alleles.size();++j)
            {
            out.print("\t"+alleles.get(j).getDisplayString());
            }
         out.println();
        }
    }

the vcf:

##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1,length=249250621,assembly=hg19>
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    S1    S2    S3    S4    S5    S6    S7    S8
chr1    1    .    T    A,C    .    .    .    GT    0/0    0/1    1/1    1/2    2/2    0/2    ./.    2/1

run

java -jar dist-1.139/bioalcidae.jar -f script.js input.vcf

chr1    1    .    S1    T    T
chr1    1    .    S2    T    A
chr1    1    .    S3    A    A
chr1    1    .    S4    A    C
chr1    1    .    S5    C    C
chr1    1    .    S6    T    C
chr1    1    .    S7        
chr1    1    .    S8    C    A
ADD COMMENT
0
Entering edit mode

Thanks. This looks like printing each variant for each sample which would be very large output especially for WGS vcf file with 500 samples as it will print 500 rows for 1 variant. Instead, it would be helpful if it could print only the variant in single row and add additional columns fwith sample names to represent hom/het variants.

ADD REPLY
0
Entering edit mode

so the script:

while(iter.hasNext())
    {
    var ctx = iter.next();
    var samples={};
    for(var i = 0;i< ctx.getNSamples();++i)
        {

        var gi = ctx.getGenotype(i);
        if( gi.getSampleName() in samples ) continue;
        out.print(ctx.contig);
        out.print("\t");
        out.print(ctx.start);
         out.print("\t");
        out.print(ctx.getID());
        out.print("\t");
        out.print(gi.getSampleName());
        samples[gi.getSampleName()]=1;
        for(var j=i+1;j<ctx.getNSamples();++j)
            {
            var gj = ctx.getGenotype(j);
            if( !(gi.sameGenotype(gj))) continue;
            samples[gj.getSampleName()]=1;
            out.print(";"+gj.getSampleName());
            }
        var alleles = gi.getAlleles();
        for(var j=0;j< alleles.size();++j)
            {
            out.print("\t"+alleles.get(j).getDisplayString());
            }
         out.println();
        }

    }

output:

chr1    1    .    S1    T    T
chr1    1    .    S2    T    A
chr1    1    .    S3    A    A
chr1    1    .    S4;S8    A    C
chr1    1    .    S5    C    C
chr1    1    .    S6    T    C
chr1    1    .    S7
ADD REPLY
0
Entering edit mode

Thanks for you time. Could we achieve it this way.

chr1    1    .    S2,S4,S6,S8   S3,S5

The script need not bother about the multiple alternate alleles. It only has to consider 0/1 or 1/1 or 1/2 or 2/2. The above output has heterozygote sample names in 4th column and homozygote sample names in 5th column.

ADD REPLY
0
Entering edit mode

Could we just save the updated script and run or should we download the tool and install?

ADD REPLY
0
Entering edit mode

uh ??

ADD REPLY
0
Entering edit mode

this is just a simple loop. I let this as an exercise...

ADD REPLY
0
Entering edit mode

I tried to download and install but unsuccessful.

$ git clone "https://github.com/lindenb/jvarkit.git"
cd jvarkit/
make bioalcidae

Terminated with the below error:

(cd /Users/aru/Downloads/jvarkit/htsjdk-1.139 && ant )
/bin/bash: ant: command not found
make: *** [/Users/aru/Downloads/jvarkit/htsjdk-1.139/dist/htsjdk-1.139.jar] Error 127

Could you help to work through the tool.

ADD REPLY

Login before adding your answer.

Traffic: 2403 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6