Question: Genotype matrix to vcf snp database
0
gravatar for kirannbishwa01
4.3 years ago by
kirannbishwa011.2k
United States
kirannbishwa011.2k wrote:

I have been able to observe the data structure and extract a portion of the file. Thanks for the help you people have been.

The portion of the data structure looks like below (only 9 columns out of 34 shown, and remaining 25 are removed just for convenience of viewing) and is tab separated.

1    1    T    N    N    N    N    N    N

1    2    C    N    N    N    N    N    N

1    3    A    N    N    N    N    N    N

1    5    T    N    N    N    N    N    N

*contd.....*

1    99    C    C    C    C    C    N    C

1    100    A    A    A    A    A    N    A

1    101    T    T    T    T    T    T    T

 

and so on.

*contd*- added to show continuity in the data.

 

The columns indicate following information:

Column #1 - Chromosome number, and there are altogether 8 chromosomes

Column #2 - read position

Column #3 - nucleotide base at that particular position for the reference genome

Column #4....34 - indicates the nucleotide base at the particular position, which may be same or different from the reference (column #3). Most of the bases in the view match the reference but at other places it varies. N- indicates the missing base.

Question:

1) I want to create a vcf file in the following format and with the following header (tab separated).

CHROM POS  REF    ALT   FREQ

 

So, I will need to check if any allele varies compared to the reference. If it does not the ALT field will be blank for that read postion, if not I will have one or more alternate allele at that position.  Could anyone help me which tools would useful for this purpose? Also, please provide me with brief guidance on script/example.

2) When calculate the allele frequency I want to leave the the missing bases out of the calculation - that means if there were 34 biological samples and I had 10 reference, 14 missing (N), and 10 alternate type I and 10 alternate type 2 allele. I want to calculate the allele frequency of reference (which would be 0.33), alt type I (0.33) and alt type 2 (0.33) based solely on available reads (excluding the missing reads).

 

Thanks much in advance,

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by kirannbishwa011.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1261 users visited in the last hour