Question: How to extract FORMAT information (GT, PG, PI, etc) from each SAMPLE in a vcf file using pyvcf in python?
0
gravatar for kirannbishwa01
23 months ago by
United States
kirannbishwa01890 wrote:

I want to extract the GT, PG and PI information from all the samples of a vcf. I have python script and reading throught pyVCF module I can extract information from one sample but not from all. And, I am looking for the solution scrictly in python using pyvcf. I can probably read the file (vcf) as txt and parse through each line but I am hoping that pyVCF can be used (which I prefer).

Below is the structure of my vcf file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  2ms02g  2ms03g  2ms04h
2   1738    .   A   G   4693.24 PASS    AC=2;AF=0.250;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      0|1:389,92:481:99:.,.,.,.,.:1.0:0|1:1020:1748,0,12243:0|1       0/0:318,0:318:99:.:.:0/0:.:0,120,1800:0/0       0|1:270,53:323:99:.,.,.,.,.:1.0:0|1:1258:990,0,9096:0|1     0/0:473,0:473:99:.:.:0/0:.:0,120,1800:0/0
2   1764    .   A   C   51892.85    PASS    AC=5;AF=0.625;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      1|0:102,415:517:99:.,.,.,.,.:1.0:1|0:1020:12332,0,2817:1|0      1/1:0,356:356:99:.:.:1/1:.:12587,1069,0:1/1     1|0:65,301:366:99:.,.,.,.,.:1.0:1|0:1258:9337,0,1279:1|0    0/1:281,353:634:99:.:.:0/1:.:10325,0,7548:0/1
2   1921    .   T   C   4465.03 PASS    AC=0;AF=0.00;AN=6;set=Intersection  GT:AD:DP:GQ:PG:PL:PW    0/0:1,0:1:3:0/0:0,3,35:0/0  ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,39:0/0  0/0:2,0:2:6:0/0:0,6,80:0/0

Below is the python script I wrote:

vcf_file = vcf.Reader(open('RNAseq.phased_variants.Final_sub.vcf', 'r'))
    for record in vcf_file:
    contig = record.CHROM
    pos = record.POS
    ref_allele = record.REF
    alt_alleles = ",".join(map(str, (record.ALT[::])))
    genotype_2ms02g = record.genotype('2ms02g')['GT']
    phased_2ms02g = record.genotype('2ms02g')['PG']
    PI_2ms02g = record.genotype('2ms02g')['PI']
    call_2ms02g = record.genotype('2ms02g') # calling the record values of the sample 2ms02g
    gt_bases_2ms02g = call_2ms02g.gt_bases   #This reports the genotype as letters (nucleotide code)

Problem #1:

I can repeat the procedure for other samples as:

    genotype_2ms01e = record.genotype('2ms01e')['GT']
    phased_2ms01e = record.genotype('2ms01e')['PG']
    PI_2ms01e = record.genotype('2ms01e')['PI']
    call_2ms01e = record.genotype('2ms01e') 
    gt_bases_2ms01e = call_2ms01e.gt_bases

But, I want to call these values from all the samples and write it as a table. vcf file may have many samples and I want to call the values from all the samples without doing so for each sample separately.

Finally I want to write the variables as tables

    output = open("2ms02gVcf_to_table-Markov.txt", "a")
    output.write(contig + '\t' + str(pos) + '\t' + ref_allele + '\t' + str(alt_alleles) + '\t' + genotype_2ms01e + '\t' + phased_2ms01e +'\t' + str(PI_2ms01e) + '\t'+ gt_bases_2ms01e + '\t' + same for other samples........ +'\n')
    output.close()

Problem #2:

Not, all sites (or lines) have PG and PI values. I wanted to throw an exception error using (by adding below script) but its not working.

    try:
        phased_genotype = record.genotype('2ms02g')['PG']
        phase_block_index = record.genotype('2ms02g')['PI']
    except IndexError:
        phased_genotype = '.'
        phase_block_index = '.'

Thank you advance !

sample format pyvcf python vcf • 1.3k views
ADD COMMENTlink modified 23 months ago • written 23 months ago by kirannbishwa01890
1
gravatar for Noushin N
23 months ago by
Noushin N530
Baltimore, MD
Noushin N530 wrote:

Problem 1

It can be addressed by using python list comprehensions as follows:

If you are sure that all samples include GT field, you can use:

sample_genotypes = [record.genotype(sampleID)['GT'] for sampleID in sampleIDs]
print >>output, '\t'.join([contig, pos, ref_allele, ','.join(alt_alleles)] + sample_genotypes)

Problem 2

In the exception clause KeyError is more appropriate here.

ADD COMMENTlink written 23 months ago by Noushin N530

Upvoted, but don't use print >> output. Use output.write()
print >> is considered old, bad synthax and should be avoided.

Alternatively: print("hi there", file=output)

See also http://stackoverflow.com/questions/6159900/correct-way-to-write-line-to-file-in-python

ADD REPLYlink modified 23 months ago • written 23 months ago by WouterDeCoster35k

Thanks for the tip! I wasn't aware that this syntax is discouraged.

ADD REPLYlink written 23 months ago by Noushin N530

@Noushin: I tried using KeyError but that didn't help. Can you look into this problem: PyVCF is giving 'AttributeError' when extracting values from FORMAT and SAMPLE column.

ADD REPLYlink written 23 months ago by kirannbishwa01890

I can use the above code with PG and PI. So, if these fields are missing in the line how do I handle it. Also, I have unresolved reference (i.e NameError) for sampleIDs.

ADD REPLYlink modified 23 months ago • written 23 months ago by kirannbishwa01890

Have you defined your sampleIDs variable before referencing it?

Regarding PG and PI fields, here is a fix that can handle missing values:

sample_PG = [record.genotype(sampleID).get('PG') if 'PG' in record.genotype(sampleID) else '.'  for sampleID in sampleIDs]
sample_PI = [record.genotype(sampleID).get('PI') if 'PI' in record.genotype(sampleID) else '.'  for sampleID in sampleIDs]

ADD REPLYlink written 23 months ago by Noushin N530

But, the problem with NameError still exists. NO, I have not defined sampleIDs variable any where. I recently changed the structure of my code which handles the extraction of the fields in a better way but exception error still exists. Can you please look into this: PyVCF is giving 'AttributeError' when extracting values from FORMAT and SAMPLE column.

ADD REPLYlink written 23 months ago by kirannbishwa01890

The way I refferenced the sampleIDs is this:

sampleIDs = record.samples[::]

Is something wrong with it?

Then I called: sample_genotypes = [record.genotype(sampleID)['GT'] for sampleID in sampleIDs]

ADD REPLYlink modified 23 months ago • written 23 months ago by kirannbishwa01890
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1229 users visited in the last hour