How to write a Python script to edit a .vcf file?
1
0
Entering edit mode
7 weeks ago
tidalArms • 0

I am trying to update an older .vcf file with some new information so that it reflects the changes that have been implemented in the . The sample IDs have changed, as well as some of the formatting of the GT, DS, and GP information (e.g. changing forward slashes to pipe symbols). I have been researching how best to go about this process using the Python packages of PyVCF, but it's not entirely clear from their docs (https://pyvcf.readthedocs.io/en/latest/INTRO.html) how one can do this. I have tried to use PyVCF Writer object that would copy the template of the new .vcf file (i.e. its metadata and format), and then I wanted to make a for loop that would iterate over each record in the old .vcf and then change each of the sample names (based on a pre-existing dict), as well as modify the content of the INFO, FORMAT, and sample result sections.

However, it does not seem that PyVCF has any tools to easily do this. So I found another library called VCFPy (https://vcfpy.readthedocs.io/en/stable), but it also does not seem that it has any clearcut tools to do this easily.

With both packages, I wanted to iterate over the old .vcf file (as a reader object), copy each sample and variant, and modify each respectively. So my code would kind of look like this below:

old_vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
#update sample names and modify GT formats
vcf_writer.write_record(record)


But does anybody know how I can readily update/modify content within each record in the above for loop easily?

vcfpy vcf python pyvcf • 564 views
0
Entering edit mode

Have you looked at the cyvcf2 documentation? That's my preferred module for working with VCFs.

0
Entering edit mode

I have not tried it yet. I will look into it now.

3
Entering edit mode
6 weeks ago
sbstevenlee ▴ 270

You may want to check out the pyvcf submodule I wrote (it's not the same as PyVCF and VCFPy).

For updating sample names, take a look at the pyvcf.VcfFrame.rename() method:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '0/1'],
...     'B': ['0/1', '0/1'],
...     'C': ['0/1', '0/1'],
...     'D': ['0/1', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B    C    D
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/1  0/1  0/1
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1  0/1  0/1
>>> vf.rename(['1', '2', '3', '4']).df
CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    1    2    3    4
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/1  0/1  0/1
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1  0/1  0/1
>>> vf.rename({'B': '2', 'C': '3'}).df
CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    2    3    D
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/1  0/1  0/1
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1  0/1  0/1
>>> vf.rename(['2', '4'], indicies=[1, 3]).df
CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    2    C    4
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/1  0/1  0/1
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1  0/1  0/1
>>> vf.rename(['2', '3'], indicies=(1, 3)).df
CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    2    3    D
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/1  0/1  0/1
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1  0/1  0/1


For updating the GT format, the exact code would depend on the type of operation you want, but you can generally achieve almost anything by applying a custom function to the pandas.DataFrame of a pyvcf.VcfFrame. For example, let's assume your goal is to replace 0/0 with 0/1, which is a stupid thing to do:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr2'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/0', '0/1'],
...     'B': ['0/0', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  0/0  0/0
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1
>>> def one_row(row):
...     row[9:] = row[9:].apply(lambda x: '0/1' if x == '0/0' else x)
...     return row
...
>>> vf.df = vf.df.apply(one_row, axis=1)
>>> vf.df
CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   G   A    .      .    .     GT  0/1  0/1
1  chr2  101  .   T   C    .      .    .     GT  0/1  0/1


If you are not confident about writing your own custom function (i.e. one_row), please let me know in the comment. Also, there are MANY pre-defined methods, so please take a look at the API first should you choose to go this route. By the way, you can easily read and write VCF files with pyvcf:

from fuc import pyvcf
vf = pyvcf.VcfFrame.from_file('in.vcf')
vf.to_file('out.vcf')

0
Entering edit mode

Thanks so much for this contribution? But out of curiosity, for the first step, how did you first get your VCF data into a dict?

1
Entering edit mode

@tidalArms, the dict is just a toy example I created from scratch to show you how to create a pyvcf.VcfFrame object; therefore, I didn't convert a VCF file into a dict. As I showed at the bottom, you will want to use the pyvcf.VcfFrame.from_file method to construct a pyvcf.VcfFrame object from a VCF file.