Closed:Genotype matrix to vcf/snp database
0
0
Entering edit mode
8.2 years ago
kirannbishwa01 ★ 1.6k

I have been able to observe the data structure and extract a portion of the file. Thanks for the help you people have been.

The portion of the data structure looks like below (only 9 columns out of 34 shown, and remaining 25 are removed just for convenience of viewing) and is tab separated.

1    1    T    N    N    N    N    N    N
1    2    C    N    N    N    N    N    N
1    3    A    N    N    N    N    N    N
1    5    T    N    N    N    N    N    N
..
..
..
1    99    C    C    C    C    C    N    C
1    100    A    A    A    A    A    N    A
1    101    T    T    T    T    T    T    T

and so on.

The columns indicate following information:

  • Column #1 - Chromosome number, and there are altogether 8 chromosomes
  • Column #2 - read position
  • Column #3 - nucleotide base at that particular position for the reference genome
  • Column #4....34 - indicates the nucleotide base at the particular position, which may be same or different from the reference (column #3). Most of the bases in the view match the reference but at other places it varies. N- indicates the missing base.

Question:

  1. I want to create a vcf file in the following format and with the following header (tab separated).

    CHROM POS  REF    ALT   FREQ
    

    So, I will need to check if any allele varies compared to the reference. If it does not the ALT field will be blank for that read postion, if not I will have one or more alternate allele at that position. Could anyone help me which tools would useful for this purpose? Also, please provide me with brief guidance on script/example.

  2. When calculate the allele frequency I want to leave the the missing bases out of the calculation - that means if there were 34 biological samples and I had 10 reference, 14 missing (N), and 10 alternate type I and 10 alternate type 2 allele. I want to calculate the allele frequency of reference (which would be 0.33), alt type I (0.33) and alt type 2 (0.33) based solely on available reads (excluding the missing reads).

Thanks much in advance

vcf SNP genotype-matrix variants allele-frequency • 63 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 3007 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6