Question: PyVCF is giving 'AttributeError' when extracting values from FORMAT and SAMPLE column.
0
gravatar for kirannbishwa01
9 months ago by
United States
kirannbishwa01570 wrote:

I have a vcf file with following data structure

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  2ms02g  2ms03g  2ms04h
2   1738    .   A   G   4693.24 PASS    AC=2;AF=0.250;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      0|1:389,92:481:99:.,.,.,.,.:1.0:0|1:1020:1748,0,12243:0|1       0/0:318,0:318:99:.:.:0/0:.:0,120,1800:0/0       0|1:270,53:323:99:.,.,.,.,.:1.0:0|1:1258:990,0,9096:0|1     0/0:473,0:473:99:.:.:0/0:.:0,120,1800:0/0
2   1764    .   A   C   51892.85    PASS    AC=5;AF=0.625;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      1|0:102,415:517:99:.,.,.,.,.:1.0:1|0:1020:12332,0,2817:1|0      1/1:0,356:356:99:.:.:1/1:.:12587,1069,0:1/1     1|0:65,301:366:99:.,.,.,.,.:1.0:1|0:1258:9337,0,1279:1|0    0/1:281,353:634:99:.:.:0/1:.:10325,0,7548:0/1
2   1921    .   T   C   4465.03 PASS    AC=0;AF=0.00;AN=6;set=Intersection  GT:AD:DP:GQ:PG:PL:PW    0/0:1,0:1:3:0/0:0,3,35:0/0  ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,39:0/0  0/0:2,0:2:6:0/0:0,6,80:0/0

Problem: The number of fields in the FORMAT column (9th column, 8th column python based) isn't the same for all the lines.

I want to read this file and mine values from specific tags like GT, PI and PG. But, all these tags are not present in all the lines; in such cases I just want to the values to be default '.'

So, the file output would have following structure:

contig  pos ref alt_My  freq_My  GT  PI  PG
2   1764    A   C   0.250   1|0   1020  1|0   
2   1921    T   C   0.00    0/0   .   0/0

I am using pyVCF module to read the file to extract these information. Below is my script:

import vcf;
vcf1_data = vcf.Reader(open('MY.phased_variants.Final_sub.vcf', 'r'))
for record in vcf1_data:
    contig1 = record.CHROM
    pos1 = record.POS
    ref_allele1 = record.REF
    alt_alleles1 = ",".join(map(str, (record.ALT[::])))
    alt_freq1 = ",".join(map(str, record.INFO['AF'])))

Now, I write these called values to an output text file as:

    output = open("My_allele_table.txt", "a")
    output.write("{}\t{}\t{}\t{}\t{}"
             .format(contig1, pos1, ref_allele1, all_alleles1, all_freq1))

Additionally, I append other values to the output file. But, when doing so I get AttributeError since PI field is not present in all the line.

Incomplete solution: I added exception to the error but then it just skips reading through the end of the line.

    for sample in record.samples:
        try:
            output.write("\t{}\t{}\t{}".format(sample['GT'], sample['PI'], sample['PG']))
        except AttributeError:
            continue
    output.write('\n')

Any help appreciated !

rna-seq gt format pyvcf vcf • 374 views
ADD COMMENTlink modified 9 months ago • written 9 months ago by kirannbishwa01570

Why can't simply if sample['PI'] ... else ... ?

ADD REPLYlink modified 9 months ago • written 9 months ago by geek_y8.0k

@Goutham: Can you please provide more details on your code.

ADD REPLYlink written 9 months ago by kirannbishwa01570
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1376 users visited in the last hour