Question: PyVCF is giving 'AttributeError' when extracting values from FORMAT and SAMPLE column.
gravatar for kirannbishwa01
4.2 years ago by
United States
kirannbishwa011.3k wrote:

I have a vcf file with following data structure

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  2ms02g  2ms03g  2ms04h
2   1738    .   A   G   4693.24 PASS    AC=2;AF=0.250;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      0|1:389,92:481:99:.,.,.,.,.:1.0:0|1:1020:1748,0,12243:0|1       0/0:318,0:318:99:.:.:0/0:.:0,120,1800:0/0       0|1:270,53:323:99:.,.,.,.,.:1.0:0|1:1258:990,0,9096:0|1     0/0:473,0:473:99:.:.:0/0:.:0,120,1800:0/0
2   1764    .   A   C   51892.85    PASS    AC=5;AF=0.625;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      1|0:102,415:517:99:.,.,.,.,.:1.0:1|0:1020:12332,0,2817:1|0      1/1:0,356:356:99:.:.:1/1:.:12587,1069,0:1/1     1|0:65,301:366:99:.,.,.,.,.:1.0:1|0:1258:9337,0,1279:1|0    0/1:281,353:634:99:.:.:0/1:.:10325,0,7548:0/1
2   1921    .   T   C   4465.03 PASS    AC=0;AF=0.00;AN=6;set=Intersection  GT:AD:DP:GQ:PG:PL:PW    0/0:1,0:1:3:0/0:0,3,35:0/0  ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,39:0/0  0/0:2,0:2:6:0/0:0,6,80:0/0

Problem: The number of fields in the FORMAT column (9th column, 8th column python based) isn't the same for all the lines.

I want to read this file and mine values from specific tags like GT, PI and PG. But, all these tags are not present in all the lines; in such cases I just want to the values to be default '.'

So, the file output would have following structure:

contig  pos ref alt_My  freq_My  GT  PI  PG
2   1764    A   C   0.250   1|0   1020  1|0   
2   1921    T   C   0.00    0/0   .   0/0

I am using pyVCF module to read the file to extract these information. Below is my script:

import vcf;
vcf1_data = vcf.Reader(open('MY.phased_variants.Final_sub.vcf', 'r'))
for record in vcf1_data:
    contig1 = record.CHROM
    pos1 = record.POS
    ref_allele1 = record.REF
    alt_alleles1 = ",".join(map(str, (record.ALT[::])))
    alt_freq1 = ",".join(map(str, record.INFO['AF'])))

Now, I write these called values to an output text file as:

    output = open("My_allele_table.txt", "a")
             .format(contig1, pos1, ref_allele1, all_alleles1, all_freq1))

Additionally, I append other values to the output file. But, when doing so I get AttributeError since PI field is not present in all the line.

Incomplete solution: I added exception to the error but then it just skips reading through the end of the line.

    for sample in record.samples:
            output.write("\t{}\t{}\t{}".format(sample['GT'], sample['PI'], sample['PG']))
        except AttributeError:

Any help appreciated !

rna-seq gt format pyvcf vcf • 1.4k views
ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by kirannbishwa011.3k

Why can't simply if sample['PI'] ... else ... ?

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by geek_y11k

@Goutham: Can you please provide more details on your code.

ADD REPLYlink written 4.2 years ago by kirannbishwa011.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1981 users visited in the last hour