Question

Genotype matrix to vcf snp database

0

Entering edit mode

8.2 years ago

kirannbishwa01 ★ 1.6k

I have been able to observe the data structure and extract a portion of the file. Thanks for the help you people have been.

The portion of the data structure looks like below (only 9 columns out of 34 shown, and remaining 25 are removed just for convenience of viewing) and is tab separated.

1    1    T    N    N    N    N    N    N
1    2    C    N    N    N    N    N    N
1    3    A    N    N    N    N    N    N
1    5    T    N    N    N    N    N    N

contd.....

1    99     C    C    C    C    C    N    C
1    100    A    A    A    A    A    N    A
1    101    T    T    T    T    T    T    T

and so on.

contd- added to show continuity in the data.

The columns indicate following information:

Column #1 - Chromosome number, and there are altogether 8 chromosomes
Column #2 - read position
Column #3 - nucleotide base at that particular position for the reference genome
Column #4....34 - indicates the nucleotide base at the particular position, which may be same or different from the reference (column #3). Most of the bases in the view match the reference but at other places it varies. N- indicates the missing base.

Question:

I want to create a vcf file in the following format and with the following header (tab separated).
```
CHROM POS  REF    ALT   FREQ
```
So, I will need to check if any allele varies compared to the reference. If it does not the ALT field will be blank for that read position, if not I will have one or more alternate allele at that position. Could anyone help me which tools would useful for this purpose? Also, please provide me with brief guidance on script/example.
When calculate the allele frequency I want to leave the the missing bases out of the calculation- that means if there were 34 biological samples and I had 10 reference, 14 missing (N), and 10 alternate type I and 10 alternate type 2 allele. I want to calculate the allele frequency of reference (which would be 0.33), alt type I (0.33) and alt type 2 (0.33) based solely on available reads (excluding the missing reads).

Thanks much in advance

SNV vcf SNP genotype variants matrix • 2.0k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by kirannbishwa01 ★ 1.6k