I have obtained genome matrix file for F1 hybrids of the model organism I am working with. My main goal is to identify/pull the polymorphism database from this file set. But, I am not finding any useful tool on the web (using google search) that can help me with handling this type of data. Additionally, I have other problems: The file came as gzipped file (819 mb) and upon extraction turned to be 14.5 gb (such a great compression/expansion) and has txt extension. I tried using gvim to view the file just to understand what kind of information/header it has but it leads my computer to freeze. I am using Ubuntu with 12 gb ram and 10 gb swap memory and it doesn't seem to be enough for the file.
Thanks in advance,
New Question added:
I have been able to observe the data structure and extract a portion of the file. Thanks for the help you people have been.
The portion of the data structure looks like below (only 9 columns out of 34 shown, and remaining 25 are removed just for convenience of viewing) and is tab separated.
1 1 T N N N N N N
1 2 C N N N N N N
1 3 A N N N N N N
1 5 T N N N N N N
1 99 C C C C C N C
1 100 A A A A A N A
1 101 T T T T T T T
and so on.
*contd*- added to show continuity in the data.
The columns indicate following information:
Column #1 - Chromosome number, and there are altogether 8 chromosomes
Column #2 - read position
Column #3 - nucleotide base at that particular position for the reference genome
Column #4....34 - indicates the nucleotide base at the particular position, which may be same or different from the reference (column #3). Most of the bases in the view match the reference but at other places it varies. N- indicates the missing base.
1) I want to create a vcf file in the following format and with the following header (tab separated).
CHROM POS REF ALT FREQ
So, I will need to check if any allele varies compared to the reference. If it does not the ALT field will be blank for that read postion, if not I will have one or more alternate allele at that position. Could anyone help me which tools would useful for this purpose? Also, please provide me with brief guidance on script/example.
2) When calculate the allele frequency I want to leave the the missing bases out of the calculation - that means if there were 34 biological samples and I had 10 reference, 14 missing (N), and 10 alternate type I and 10 alternate type 2 allele. I want to calculate the allele frequency of reference (which would be 0.33), alt type I (0.33) and alt type 2 (0.33) based solely on available reads (excluding the missing reads).
Thanks much in advance,