Question: how to get the genotypes from vcf file.
2
gravatar for fufuyou
3.8 years ago by
fufuyou110
United States
fufuyou110 wrote:

Hi, 

I have a vcf files with mutiblesamples. I want get the genotype data from the vcf file. I do not know how to do it. Could you tell me how I can get it.

Thanks,

Fuyou

Thank you everybody.

It is my fault I did not say clearly what I want to do.

For example, I have get my vcf file is 

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1   sample 2      sample3

Chr01   1353    .       T       C       2385.77 PASS    FS=0;MQ=51.64;MQ0=0;QD=29.02;SOR=0.818;BaseQRankSum=1.468;ClippingRankSum=-0.886;MQRankSum=-2.172;ReadPosRankSum=-1.03;DP=666;AF=1;MLEAC=2;MLEAF=1;AN=70;AC=67  GT:AD:DP:GQ:PL  1/1:0,15:15:48:707,48,0 0/1:36,28:64:99:1147,0,11882   1/1:0,14:14:42:630,42,0 1/1:0,29:29:87:1299,87,0 

I want to get a file like as:

#CHROM  POS     ID      REF     ALT    sample1   sample 2      sample3

Chr01        1353    .            T       C         CT              CC              CC

Thanks,

Fuyou

myposts snp • 4.7k views
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by fufuyou110
1

You're not giving us much to work with.  What genotypes? Why do you want the genotypes?

ADD REPLYlink written 3.8 years ago by Zev.Kronenberg11k

Thanks Zev,

It is my fault I did not say clearly what I want to do.

For example, I have get my vcf file is 

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1   sample 2      sample3

Chr01   1353    .       T       C       2385.77 PASS    FS=0;MQ=51.64;MQ0=0;QD=29.02;SOR=0.818;BaseQRankSum=1.468;ClippingRankSum=-0.886;MQRankSum=-2.172;ReadPosRankSum=-1.03;DP=666;AF=1;MLEAC=2;MLEAF=1;AN=70;AC=67  GT:AD:DP:GQ:PL  1/1:0,15:15:48:707,48,0 0/1:36,28:64:99:1147,0,11882   1/1:0,14:14:42:630,42,0 1/1:0,29:29:87:1299,87,0 

I want to get a file like as:

#CHROM  POS     ID      REF     ALT    sample1   sample 2      sample3

Chr01        1353    .            T       C         CT              CC              CC

Thanks,

Fuyou

ADD REPLYlink written 3.8 years ago by fufuyou110
1
gravatar for MAPK
3.8 years ago by
MAPK1.5k
United States
MAPK1.5k wrote:

I think you just want the GT column along with the AD column. I would also get all the columns including REF and ALT and make a nicely formatted tab separated table structure using R scripts. 

Take a look at this :

http://samtools.github.io/hts-specs/VCFv4.1.pdf

ADD COMMENTlink written 3.8 years ago by MAPK1.5k
1
gravatar for Manuel Landesfeind
3.8 years ago by
Göttingen, Germany
Manuel Landesfeind1.2k wrote:

See Query EXAMPLES at https://samtools.github.io/bcftools/bcftools.html#query

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Manuel Landesfeind1.2k
1
gravatar for Dan Gaston
3.8 years ago by
Dan Gaston7.1k
Canada
Dan Gaston7.1k wrote:

There are probably about a 1001 ways to do this, so which approach works best for you depends on what you want to do with the genotypes. You can use bcftools as suggested by Manuel (or vcftools, but bcftools is much faster. However output format can differ so sometimes one is preferred over the other depending on the desired format). If you want to do complex analyses with the genotypes you may want to write a script in python for instance, in which case PyVCF or the much faster cython implementation cyvcf would be really useful. There is also cyvcf2, which is even faster and written around htslib but has a very specific use-case it was designed for so it might not suit your purposes. And of course, because the data is columnar, you can always parse out the genotype columns using standard linux command line tools, parse the fields, and do something. But I typically recommend some of the specialized tools and coding modules that have been written, as they are much easier typically.

 

Update:

Using PyVCF or cyVCF in python you could do something like this:

 

import vcf
with open('output.txt', 'w') as output:
    vcf_reader = vcf.Reader(open('input.vcf', 'r'))
    for record in vcf_reader:
        output.write("{}\t{}\t{}\t{}\t{}\t{}".format(record.CHROM, record.POS. record.ID,                                                        record.REF, record.ALT))
        for sample in record.samples:
             output.write("\t{}".format(sample['GT']))
         output.write("\n")

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Dan Gaston7.1k

@Dan Gaston:

This answer works well for one of the problem I have.

But, if say there is a PG field in FORMAT and it corresponding values in SAMPLE field. And, this PG isn't represented in everyline of the vcf how do I capture format(sample['PG']

I am getting following error:

AttributeError: 'CallData' object has no attribute 'PI'

How, do I invoke an exception and force the PI value to be period '.' when PI tag is missing?

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by kirannbishwa011.1k

This is a standard thing to do in Python. You can wrap your statement in a Try

try:
   format(sample['PG'])
except AttributeError:
   #do something
ADD REPLYlink written 2.9 years ago by Dan Gaston7.1k

I was able to use except AttributeError: but the problem is when there is another tag which gives IndexError for some lines too. That way I have to make a combinations of except almost 4 times with different condition. And, if there are more tags that are missing in some parts of the vcf it would be unmanageable.

I was hoping there was a way something like this but it didn't work:

format(sample['PG'] else '.', sample ['GT'] else '.', sample['PI'] else '.') This way whereever a required tag/field is missing in FORMAT and SAMPLE field it would add it as default period (.). I tried to implement this in every possible way but I would always get no attribute 'PG' in _Call something something...

Problem 2: Another problem is getting the sample name and gt_bases in a for-loop. For sample in record.samples: called_name = record.genotype('sample_name') allele_base = called_name.gt_bases

Here sample_name should be given exclusively. Using sample['sample'] wouldn't work, and I would receive an errror.

So, I used:

    For sample in record.samples:
            called_name = (str(sample).split(' ')[0]).split('=')[1].strip(',')

This method would call the name in the for loop and let me print/write the value of allele-bases.

            allele_base = called_name.gt_bases

Is there a method more efficient for the above purpose in pyVCF module. I read through the documentation but couldn't fine one.

ADD REPLYlink written 2.9 years ago by kirannbishwa011.1k
1

You can catch multiple exception types in a single exception. You can also catch ANY exception with just a bare except clause, but that is not considered to be good programming practice as any error will be caught by that exception block. But as you said, you are then still left with the problem of using a bunch of logic to figure out what needs to be replaced. Have you tried the or operator? I often do things like:

var = ",".join(list_of_vars) or None

to deal with things like missing lists or objects with nothing in a particular field. I suspect it will work here in your format strong and with using periods. Worth trying.

For your second problem record.samples is a list. You can also use the genotype() method to get a genotype for a specific sample using a name.

ADD REPLYlink written 2.9 years ago by Dan Gaston7.1k

I will try the proposed solution for 1st problem.

Regarding 2nd problem, I had already tried what you suggested, but there's is a problem with the method. See this link: How to call genotype bases for all the samples?

ADD REPLYlink written 2.9 years ago by kirannbishwa011.1k

Adding an answer in that question

ADD REPLYlink written 2.9 years ago by Dan Gaston7.1k
0
gravatar for fufuyou
3.8 years ago by
fufuyou110
United States
fufuyou110 wrote:

Thank you everybody.

It is my fault I did not say clearly what I want to do.

For example, I have get my vcf file is 

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1   sample 2      sample3

Chr01   1353    .       T       C       2385.77 PASS    FS=0;MQ=51.64;MQ0=0;QD=29.02;SOR=0.818;BaseQRankSum=1.468;ClippingRankSum=-0.886;MQRankSum=-2.172;ReadPosRankSum=-1.03;DP=666;AF=1;MLEAC=2;MLEAF=1;AN=70;AC=67  GT:AD:DP:GQ:PL  1/1:0,15:15:48:707,48,0 0/1:36,28:64:99:1147,0,11882   1/1:0,14:14:42:630,42,0 1/1:0,29:29:87:1299,87,0 

I want to get a file like as:

#CHROM  POS     ID      REF     ALT    sample1   sample 2      sample3

Chr01        1353    .            T       C         CT              CC              CC

Thanks,

Fuyou

 

ADD COMMENTlink written 3.8 years ago by fufuyou110
1

It is better if you include this as an Update to your original post and not add it as an answer (you can hit the edit button to edit your original post and add to the bottom, typically after putting in that it is an Update that follows). All you are doing is stripping out some fields. You can use linux tools to cut out columns from tabular data. Any of the programming options I mentioned in my post will also let you do this by iterating over the VCF records and only outputting the desired fields to a new output file.

ADD REPLYlink written 3.8 years ago by Dan Gaston7.1k

Thank you very much!

ADD REPLYlink written 3.8 years ago by fufuyou110

If my answer solves your problem please upvote and accept as the answer. Much appreciated

ADD REPLYlink written 3.8 years ago by Dan Gaston7.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 918 users visited in the last hour