Select genotype from VCF file in Python
1
0
Entering edit mode
2.0 years ago
tea.vuki ▴ 10

Hello,

How do I select only genotype (1/1, 0/1 etc..) data from VCF file using PySam?

vcf python pysam • 2.1k views
ADD COMMENT
2
Entering edit mode
2.0 years ago

To process VCF file in Python I would recommend cyvcf2

makes parsing a VCF a joyous event:

from cyvcf2 import VCF

for variant in VCF('some.vcf.gz'): # or VCF('some.bcf')
    variant.REF, variant.ALT # e.g. REF='A', ALT=['C', 'T']

    variant.CHROM, variant.start, variant.end, variant.ID, \
                variant.FILTER, variant.QUAL

    # numpy arrays of specific things we pull from the sample fields.
    # gt_types is array of 0,1,2,3==HOM_REF, HET, UNKNOWN, HOM_ALT
    variant.gt_types, variant.gt_ref_depths, variant.gt_alt_depths # numpy arrays
    variant.gt_phases, variant.gt_quals, variant.gt_bases # numpy array


    ## INFO Field.
    ## extract from the info field by it's name:
    variant.INFO.get('DP') # int
    variant.INFO.get('FS') # float
    variant.INFO.get('AC') # float
ADD COMMENT
0
Entering edit mode

Thanks! I will try cyvcf2 to learn it, but now I kind of have to go through this with pysam.. and this code:

for rec in bcf_in.fetch():
    print(rec.info)
    print(rec.info.keys())
    print(rec.info["DP"])

doesn't work for me

ADD REPLY
0
Entering edit mode

doesn't work is not a good way to state a problem,

there could be a myriad of reasons why something does not work no one could possibly guess what does not work

You should always state the exact error message you get

ADD REPLY
0
Entering edit mode

Sorry.

The error message is: 'Unknown INFO field: DP'

When I try "GT" instead of DP I get the same.

And this is the code I found in the pysam documentation under "but also to complex attributes such as the contents to the info, format and genotype columns. These complex attributes are views on the underlying htslib data structures and provide dictionary-like access to the data:"

ADD REPLY
0
Entering edit mode

It means that your VCF file does not have that field. Check the VCF file, how is the DP tag reported?

Sometimes you get it reported as FORMAT/DP

ADD REPLY
0
Entering edit mode

Oh, now I see. I don't know why documentation put it under "info" when this ID is actually under format.

So when I do:

for rec in object_ref.fetch():
    print(rec.info)
    print(rec.info.keys())
    print(rec.format["GT"])

I get the output. However, I get it like this:

<pysam.libcbcf.VariantRecordInfo object at 0x106adeec0>
[]
<pysam.libcbcf.VariantMetadata object at 0x105b3b310>
<pysam.libcbcf.VariantRecordInfo object at 0x105c188e0>
[]

And what I want is to grab exact values

ADD REPLY
0
Entering edit mode

your output does not match the code, your code prints three lines, the output has five lines.

regardless, the solution here is to learn the ins and outs of PySam and the way the variants are represented, look at the API and docs for VariantRecordInfo

but the real solution is to drop PySam and use cyvcf2 ... there is a reason that cfvcf2 exists, and the main reason is PySam is confusing has a counterintuitive interface that is best avoided

ADD REPLY
0
Entering edit mode

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1507 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6