Question: How to call the sample level genotype INFO when certain `FORMAT` fields (or attributes) are missing in some lines of the vcf file?
1
gravatar for kirannbishwa01
2.2 years ago by
United States
kirannbishwa01950 wrote:

I have a vcf file with following data structure

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  2ms02g  2ms03g  2ms04h
2   1738    .   A   G   4693.24 PASS    AC=2;AF=0.250;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      0|1:389,92:481:99:.,.,.,.,.:1.0:0|1:1020:1748,0,12243:0|1       0/0:318,0:318:99:.:.:0/0:.:0,120,1800:0/0       0|1:270,53:323:99:.,.,.,.,.:1.0:0|1:1258:990,0,9096:0|1     0/0:473,0:473:99:.:.:0/0:.:0,120,1800:0/0
2   1764    .   A   C   51892.85    PASS    AC=5;AF=0.625;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW      1|0:102,415:517:99:.,.,.,.,.:1.0:1|0:1020:12332,0,2817:1|0      1/1:0,356:356:99:.:.:1/1:.:12587,1069,0:1/1     1|0:65,301:366:99:.,.,.,.,.:1.0:1|0:1258:9337,0,1279:1|0    0/1:281,353:634:99:.:.:0/1:.:10325,0,7548:0/1
2   1921    .   T   C   4465.03 PASS    AC=0;AF=0.00;AN=6;set=Intersection  GT:AD:DP:GQ:PG:PL:PW    0/0:1,0:1:3:0/0:0,3,35:0/0  ./.:0,0:0:.:./.:0,0,0:./.   0/0:1,0:1:3:0/0:0,3,39:0/0  0/0:2,0:2:6:0/0:0,6,80:0/0

Problem: Unfortunately the number of fields in the FORMAT column (9th column, 8th column python based) isn't the same for all the lines and pyVCF is giving me error when trying to call certain values from the vcf file.

I want to read this file and mine values from specific tags like GT, PI and PG. But, all these tags (mainly the PI tag) are not present in all the lines; in such cases I just want to the value output to be default period i.e '.'

So, the file output would have following structure:

contig  pos ref alt_My  freq_My  GT  PI  PG
2   1764    A   C   0.250   1|0   1020  1|0   
2   1921    T   C   0.00    0/0   .   0/0

Below is my script:

import vcf;
vcf1_data = vcf.Reader(open('MY.phased_variants.Final_sub.vcf', 'r'))
for record in vcf1_data:
    contig1 = record.CHROM
    pos1 = record.POS
    ref_allele1 = record.REF
    alt_alleles1 = ",".join(map(str, (record.ALT[::])))
    alt_freq1 = ",".join(map(str, record.INFO['AF'])))

Now, I write these called values to an output text file as:

    output = open("My_allele_table.txt", "a")
    output.write("{}\t{}\t{}\t{}\t{}"
             .format(contig1, pos1, ref_allele1, all_alleles1, all_freq1))

Additionally, I append other values to the output file. But, when doing so I get AttributeError since PI field is not present in all the line.

Incomplete solution: I added exception to the error but then it just skips reading through the end of the line.

    for sample in record.samples:
        try:
            output.write("\t{}\t{}\t{}".format(sample['GT'], sample['PI'], sample['PG']))
        except AttributeError:
            continue
    output.write('\n')

Additionally, I tried:

    if sample['PI'] not in sample:
        PI = '.'
    else: PI = sample['PI']

Still not working.

Any help appreciated !

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by kirannbishwa01950

Why do you keep deleting old threads and creating new? It's not likely that we suddenly now do know the answer.

ADD REPLYlink written 2.2 years ago by WouterDeCoster37k

It's because when there is no answer in first 18 hours, there will never be an answer. Either there are very few resources personnel volunteering for answering, or I am not asking questions that are interesting enough. Fact is most questions are not interesting !

ADD REPLYlink written 2.2 years ago by kirannbishwa01950
1

Okay, let's try to troubleshoot this. But as you know, we are volunteers so you'll have to be patient. It's Friday night and I just poured a glass of red wine, so let's talk python.

What's not working in this case?

if sample['PI'] not in sample:
    PI = '.'
else: 
    PI = sample['PI']

Have you tried the following?

for sample in record.samples:
    try:
        output.write("\t{}\t{}\t{}".format(sample['GT'], sample['PI'], sample['PG']))
    except AttributeError:
        output.write("\t{}\t{}\t{}".format(sample['GT'], ".", sample['PG']))
ADD REPLYlink written 2.2 years ago by WouterDeCoster37k

HI @WouterDeCoster: Thanks much. The given code worked. I was more than frustrated and had been pulling my hair for 2 full days. I read through several examples and tutorials to mine the data, but this led me no-where. I reposted this question almost 3 times hoping that I had made the problem-explanation more simpler each time, and hoping someone would read it eventually.

I also couldn't move past this problem because the downstream analyses all depends upon this. Sometimes, I think why I even drown myself into this project/problem. And, there is a lots of assumption that the problem is not big at all - but for biologist it is. I can have a full fledged debate and analyses on evolutionary theories (which is my expertise) at quite decent level but the coding problem just drives me nuts sometimes - we all have our weaknesses, but it turns into frustration when the weakness becomes unmanageable.

The solution you proposed was a simple and elegant one which I now realize I was always around it, but I really didn't find it in the dark.

Thank again for being patient on my comments !

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by kirannbishwa01950

Happy to help. Good luck with the rest of the project, and feel free to ask more questions when necessary.

ADD REPLYlink written 2.2 years ago by WouterDeCoster37k

Another reason is I am working in a lab (and a department) where there is not IT support at all and I am all from a Bio background and my professor doesn't have a network to help. Any problem that I am facing is solely mine and not shared. Am I desperate for an answer - yes I am. Have I researched the problem - yes I have. Am I pyton or other IT expert - no I am not, but I have managed to learn to the extent I can.

ADD REPLYlink written 2.2 years ago by kirannbishwa01950
1

Based on your code you already learned a lot. Don't lose hope, learning to code takes time. I also learned Python on my own, mostly.

ADD REPLYlink written 2.2 years ago by WouterDeCoster37k

How is your VCF generated? It seems strange that you would need to do this.

Just a simple tip, initialize a string to some value, do the try/except loop to check values and set the new value, and then write out the line after you build the string you hope to write out (or something similar).

ADD REPLYlink written 2.2 years ago by mc20

This vcf was generate using GATK first and then later updated by another software called pHASER. All, was well, but the way the header info was parsed by this later software was a bit different - reason being that the author didn't use pyVCF module to parse it, rather the author read the the vcf files as text file and then parsed through it line-by-line. I confirm this because I have his python software and reading through it I saw pyVCF was never used.

All, was well until I took the vcf file generated by this pHASER and then ran it again through GATK to combine the variants. This is when the values in the FORMAT and SAMPLE fields got updated and rearranged. Though this process didn't affect the values in the vcf and corresponding fields, but GATK removed certain fields that were not present in all the samples - which led to unequal number of fields in FORMAT and SAMPLE column.

My problem: Now, pyVCF doesn't handle the call properly if the field is missing in even one line of the vcf.

I think we need to put the programmers together in one room, so their final software is clean; and doesn't trouble the front end user to a greater extent. Lol !

I hope I am making sense with what I have explained. :)

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by kirannbishwa01950

Thanks for the explanation. You may want to participate on the GATK forums to see why this is happening. if it is a side-effect you may want to point it out to the authors of the software so they can fix it, or explain why it is behaving the way it is. Sorry, I don't have experience with GATK. I think most maintainers of open source software are happy to hear from users.

ADD REPLYlink written 2.2 years ago by mc20

Sounds to me the problem was in pHASER, if I understood correctly.

ADD REPLYlink written 2.2 years ago by WouterDeCoster37k

I think it's partly both. I will need to collect more details and report to both the author.

ADD REPLYlink written 2.2 years ago by kirannbishwa01950
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1913 users visited in the last hour