Applying Machine Learning on vcf file
2
0
Entering edit mode
11 months ago

I am learning machine learning for genomics data. I have a vcf file and I want to split my data into X and Y labels. I just want to know how can I do that ?? should I mark GT columns as 0/0 , 0/1, 1/1 into 1,2,3 and mark them as and rs ids as my predictors ?? I want to apply logistic regression on this.!

machine vcf learning • 1.5k views
1
Entering edit mode
1
Entering edit mode
11 months ago
Anh ▴ 20

From this paper The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines, they used categorical encoding for binary results ./., 0/0, 0/1, 1/1. You should also read some papers on machine learning approaches to the variant calling problem, especially on how they defined TP, TN, FP, FN based on their truth datasets or synthetic datasets. I will include two papers here for reference.

0
Entering edit mode
11 months ago
sbstevenlee ▴ 480

Hi, I just wanted to introduce you to one of my packages called fuc (GitHub). It has a submodule called pyvcf (API) which is designed for working with VCF files. It implements pyvcf.VcfFrame which stores VCF data as pandas.DataFrame to allow fast computation and easy manipulation. One of my main goals for writing pyvcf was to create a tool that can make VCF more friendly to ML analyses and data mining. So please check it out if you're interested!

Below is an example usage of pyvcf to detect/visualize structural variation from VCF.

from fuc import pyvcf, common
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

cyp2d6_starts = [42522500,42522852,42523448,42523843,42524175,42524785,42525034,42525739,42526613]
cyp2d6_ends = [42522754,42522994,42523636,42523985,42524352,42524946,42525187,42525911,42526883]
cyp2d7_starts = [42536213,42536565,42537161,42537543,42537877,42538479,42538728,42539410,42540284]
cyp2d7_ends = [42536467,42536707,42537349,42537685,42538054,42538640,42538881,42539582,42540576]

vcf_file = '~/fuc-data/pyvcf/getrm-cyp2d6-vdr.vcf'
vf = pyvcf.VcfFrame.from_file(vcf_file)

fig, axes = plt.subplots(2, 3, figsize=(18, 12))

[[ax1, ax2, ax3], [ax4, ax5, ax6]] = axes

vf.plot_region('NA18973', ax=ax1, color='tab:green')
vf.plot_region('HG00276', ax=ax2, color='tab:green')
vf.plot_region('NA19109', ax=ax3, color='tab:green')

ax1.set_title('NA18973 (no structural variation)', fontsize=25)
ax2.set_title('NA10831 (CYP2D6 deletion)', fontsize=25)
ax3.set_title('NA19109 (CYP2D6 duplication)', fontsize=25)

ax4.set_ylabel('Allele fraction')
ax4.legend(loc='upper left', fontsize=20, markerscale=2)

for ax in [ax1, ax2, ax3]:
ax.set_xlabel('')
ax.set_xticklabels([])
ax.set_ylim([-15, 120])
common.plot_exons(cyp2d6_starts, cyp2d6_ends, name='CYP2D6', offset=8, y=-5, height=5, ax=ax)
common.plot_exons(cyp2d7_starts, cyp2d7_ends, name='CYP2D7', offset=8, y=-5, height=5, ax=ax)

for ax in [ax2, ax3, ax5, ax6]:
ax.set_ylabel('')
ax.set_yticklabels([])

for ax in [ax1, ax4, ax5, ax6]:
ax.xaxis.label.set_size(20)
ax.yaxis.label.set_size(20)
ax.tick_params(axis='both', which='major', labelsize=15)

for ax in [ax4, ax5, ax6]:
ax.set_ylim([-0.15, 1.1])
common.plot_exons(cyp2d6_starts, cyp2d6_ends, name='CYP2D6', offset=0.075, y=-0.05, height=0.05, ax=ax)
common.plot_exons(cyp2d7_starts, cyp2d7_ends, name='CYP2D7', offset=0.075, y=-0.05, height=0.05, ax=ax)

plt.tight_layout()
plt.savefig('vcf_sv.png')

0
Entering edit mode

Nice tool. I'm guessing you looked at ALOT of those CYP2D6 deletions .

How long are those deletions typically (or does it vary quite a bit)?

Are the deletions typically positioned in the same spot (similar start and stop position), do they have start and stop "hotspots", or do they start and stop kind of all over the place in the window you are looking at there?

I would guess because of the observation of hybrids that CYP2D6 deletions have a lot of variety in position, but I am no expert and would love hear what you have seen.

0
Entering edit mode

Great questions! There is a plethora of structural variants in the CYP2D6 gene including gene deletions, duplications, and hybrids (with a psuedogene called CYP2D7). Each of those structural variants varies in their breakpoints and copy number status.

As for gene deletions, there is one particular variant that is predominant in the population (it's called CYP2D6*5) and it is almost 15 kb in size. It's the one that's shown in the middle panel by the way.

If you are interested in reading more on structural variation detection for CYP2D6 and other pharmacogenes using NGS data, please refer to my papers:

Stargazer: a software tool for calling star alleles from next-generation sequencing data using CYP2D6 as a model

Calling star alleles with Stargazer in 28 pharmacogenes with whole genome sequences

0
Entering edit mode

in NA10831 and NA19109 it looks like the deletion spans most the length of cyp2d6 and breaks just before the start of cyp2d7. This is probably a deletion that breaks at rep7 and rep6? This is typically the definition of *5?