Question: PyVCF is giving 'AttributeError' when extracting values from FORMAT and SAMPLE column.
9 months ago by
United States
kirannbishwa01570 wrote:

I have a vcf file with following data structure

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  2ms02g  2ms03g  2ms04h
2   1738    .   A   G   4693.24 PASS    AC=2;AF=0.250;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      0|1:389,92:481:99:.,.,.,.,.:1.0:0|1:1020:1748,0,12243:0|1       0/0:318,0:318:99:.:.:0/0:.:0,120,1800:0/0       0|1:270,53:323:99:.,.,.,.,.:1.0:0|1:1258:990,0,9096:0|1     0/0:473,0:473:99:.:.:0/0:.:0,120,1800:0/0
2   1764    .   A   C   51892.85    PASS    AC=5;AF=0.625;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      1|0:102,415:517:99:.,.,.,.,.:1.0:1|0:1020:12332,0,2817:1|0      1/1:0,356:356:99:.:.:1/1:.:12587,1069,0:1/1     1|0:65,301:366:99:.,.,.,.,.:1.0:1|0:1258:9337,0,1279:1|0    0/1:281,353:634:99:.:.:0/1:.:10325,0,7548:0/1
2   1921    .   T   C   4465.03 PASS    AC=0;AF=0.00;AN=6;set=Intersection  GT:AD:DP:GQ:PG:PL:PW    0/0:1,0:1:3:0/0:0,3,35:0/0  ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,39:0/0  0/0:2,0:2:6:0/0:0,6,80:0/0

Problem: The number of fields in the FORMAT column (9th column, 8th column python based) isn't the same for all the lines.

I want to read this file and mine values from specific tags like GT, PI and PG. But, all these tags are not present in all the lines; in such cases I just want to the values to be default '.'

So, the file output would have following structure:

contig  pos ref alt_My  freq_My  GT  PI  PG
2   1764    A   C   0.250   1|0   1020  1|0   
2   1921    T   C   0.00    0/0   .   0/0

I am using pyVCF module to read the file to extract these information. Below is my script:

import vcf;
vcf1_data = vcf.Reader(open('MY.phased_variants.Final_sub.vcf', 'r'))
for record in vcf1_data:
    contig1 = record.CHROM
    pos1 = record.POS
    ref_allele1 = record.REF
    alt_alleles1 = ",".join(map(str, (record.ALT[::])))
    alt_freq1 = ",".join(map(str, record.INFO['AF'])))

Now, I write these called values to an output text file as:

    output = open("My_allele_table.txt", "a")
             .format(contig1, pos1, ref_allele1, all_alleles1, all_freq1))

Additionally, I append other values to the output file. But, when doing so I get AttributeError since PI field is not present in all the line.

Incomplete solution: I added exception to the error but then it just skips reading through the end of the line.

    for sample in record.samples:
            output.write("\t{}\t{}\t{}".format(sample['GT'], sample['PI'], sample['PG']))
        except AttributeError:

Any help appreciated !

rna-seq gt format pyvcf vcf • 374 views
rna-seq gt format pyvcf vcf • 374 views

Why can't simply if sample['PI'] ... else ... ?

ADD REPLYlink modified 9 months ago • written 9 months ago by geek_y8.0k

@Goutham: Can you please provide more details on your code.

ADD REPLYlink written 9 months ago by kirannbishwa01570
