Question: How to extract FORMAT information (GT, PG, PI, etc) from each SAMPLE in a vcf file using pyvcf in python?
0
gravatar for kirannbishwa01
10 months ago by
United States
kirannbishwa01590 wrote:

I want to extract the GT, PG and PI information from all the samples of a vcf. I have python script and reading throught pyVCF module I can extract information from one sample but not from all. And, I am looking for the solution scrictly in python using pyvcf. I can probably read the file (vcf) as txt and parse through each line but I am hoping that pyVCF can be used (which I prefer).

Below is the structure of my vcf file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  2ms02g  2ms03g  2ms04h
2   1738    .   A   G   4693.24 PASS    AC=2;AF=0.250;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      0|1:389,92:481:99:.,.,.,.,.:1.0:0|1:1020:1748,0,12243:0|1       0/0:318,0:318:99:.:.:0/0:.:0,120,1800:0/0       0|1:270,53:323:99:.,.,.,.,.:1.0:0|1:1258:990,0,9096:0|1     0/0:473,0:473:99:.:.:0/0:.:0,120,1800:0/0
2   1764    .   A   C   51892.85    PASS    AC=5;AF=0.625;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      1|0:102,415:517:99:.,.,.,.,.:1.0:1|0:1020:12332,0,2817:1|0      1/1:0,356:356:99:.:.:1/1:.:12587,1069,0:1/1     1|0:65,301:366:99:.,.,.,.,.:1.0:1|0:1258:9337,0,1279:1|0    0/1:281,353:634:99:.:.:0/1:.:10325,0,7548:0/1
2   1921    .   T   C   4465.03 PASS    AC=0;AF=0.00;AN=6;set=Intersection  GT:AD:DP:GQ:PG:PL:PW    0/0:1,0:1:3:0/0:0,3,35:0/0  ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,39:0/0  0/0:2,0:2:6:0/0:0,6,80:0/0

Below is the python script I wrote:

vcf_file = vcf.Reader(open('RNAseq.phased_variants.Final_sub.vcf', 'r'))
    for record in vcf_file:
    contig = record.CHROM
    pos = record.POS
    ref_allele = record.REF
    alt_alleles = ",".join(map(str, (record.ALT[::])))
    genotype_2ms02g = record.genotype('2ms02g')['GT']
    phased_2ms02g = record.genotype('2ms02g')['PG']
    PI_2ms02g = record.genotype('2ms02g')['PI']
    call_2ms02g = record.genotype('2ms02g') # calling the record values of the sample 2ms02g
    gt_bases_2ms02g = call_2ms02g.gt_bases   #This reports the genotype as letters (nucleotide code)

Problem #1:

I can repeat the procedure for other samples as:

    genotype_2ms01e = record.genotype('2ms01e')['GT']
    phased_2ms01e = record.genotype('2ms01e')['PG']
    PI_2ms01e = record.genotype('2ms01e')['PI']
    call_2ms01e = record.genotype('2ms01e') 
    gt_bases_2ms01e = call_2ms01e.gt_bases

But, I want to call these values from all the samples and write it as a table. vcf file may have many samples and I want to call the values from all the samples without doing so for each sample separately.

Finally I want to write the variables as tables

    output = open("2ms02gVcf_to_table-Markov.txt", "a")
    output.write(contig + '\t' + str(pos) + '\t' + ref_allele + '\t' + str(alt_alleles) + '\t' + genotype_2ms01e + '\t' + phased_2ms01e +'\t' + str(PI_2ms01e) + '\t'+ gt_bases_2ms01e + '\t' + same for other samples........ +'\n')
    output.close()

Problem #2:

Not, all sites (or lines) have PG and PI values. I wanted to throw an exception error using (by adding below script) but its not working.

    try:
        phased_genotype = record.genotype('2ms02g')['PG']
        phase_block_index = record.genotype('2ms02g')['PI']
    except IndexError:
        phased_genotype = '.'
        phase_block_index = '.'

Thank you advance !

sample format pyvcf python vcf • 549 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by kirannbishwa01590
1
gravatar for Noushin N
10 months ago by
Noushin N390
Baltimore, MD
Noushin N390 wrote:

Problem 1

It can be addressed by using python list comprehensions as follows:

If you are sure that all samples include GT field, you can use:

sample_genotypes = [record.genotype(sampleID)['GT'] for sampleID in sampleIDs]
print >>output, '\t'.join([contig, pos, ref_allele, ','.join(alt_alleles)] + sample_genotypes)

Problem 2

In the exception clause KeyError is more appropriate here.

ADD COMMENTlink written 10 months ago by Noushin N390

Upvoted, but don't use print >> output. Use output.write()
print >> is considered old, bad synthax and should be avoided.

Alternatively: print("hi there", file=output)

See also http://stackoverflow.com/questions/6159900/correct-way-to-write-line-to-file-in-python

ADD REPLYlink modified 10 months ago • written 10 months ago by WouterDeCoster23k

Thanks for the tip! I wasn't aware that this syntax is discouraged.

ADD REPLYlink written 10 months ago by Noushin N390

@Noushin: I tried using KeyError but that didn't help. Can you look into this problem: PyVCF is giving 'AttributeError' when extracting values from FORMAT and SAMPLE column.

ADD REPLYlink written 10 months ago by kirannbishwa01590

I can use the above code with PG and PI. So, if these fields are missing in the line how do I handle it. Also, I have unresolved reference (i.e NameError) for sampleIDs.

ADD REPLYlink modified 10 months ago • written 10 months ago by kirannbishwa01590

Have you defined your sampleIDs variable before referencing it?

Regarding PG and PI fields, here is a fix that can handle missing values:

sample_PG = [record.genotype(sampleID).get('PG') if 'PG' in record.genotype(sampleID) else '.'  for sampleID in sampleIDs]
sample_PI = [record.genotype(sampleID).get('PI') if 'PI' in record.genotype(sampleID) else '.'  for sampleID in sampleIDs]

ADD REPLYlink written 10 months ago by Noushin N390

But, the problem with NameError still exists. NO, I have not defined sampleIDs variable any where. I recently changed the structure of my code which handles the extraction of the fields in a better way but exception error still exists. Can you please look into this: PyVCF is giving 'AttributeError' when extracting values from FORMAT and SAMPLE column.

ADD REPLYlink written 10 months ago by kirannbishwa01590

The way I refferenced the sampleIDs is this:

sampleIDs = record.samples[::]

Is something wrong with it?

Then I called: sample_genotypes = [record.genotype(sampleID)['GT'] for sampleID in sampleIDs]

ADD REPLYlink modified 10 months ago • written 10 months ago by kirannbishwa01590
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1104 users visited in the last hour