Handling/Extracting genome matrix file
1
0
Entering edit mode
5.7 years ago
kirannbishwa01 ★ 1.3k

I have obtained genome matrix file for F1 hybrids of the model organism I am working with. My main goal is to identify/pull the polymorphism database from this file set. But, I am not finding any useful tool on the web (using google search) that can help me with handling this type of data. Additionally, I have other problems: The file came as gzipped file (819 mb) and upon extraction turned to be 14.5 gb (such a great compression/expansion) and has txt extension. I tried using gvim to view the file just to understand what kind of information/header it has but it leads my computer to freeze. I am using Ubuntu with 12 gb ram and 10 gb swap memory and it doesn't seem to be enough for the file.

Thanks in advance,

 

New Question added:

I have been able to observe the data structure and extract a portion of the file. Thanks for the help you people have been.

The portion of the data structure looks like below (only 9 columns out of 34 shown, and remaining 25 are removed just for convenience of viewing) and is tab separated.

1    1    T    N    N    N    N    N    N

1    2    C    N    N    N    N    N    N

1    3    A    N    N    N    N    N    N

1    5    T    N    N    N    N    N    N

*contd.....*

1    99    C    C    C    C    C    N    C

1    100    A    A    A    A    A    N    A

1    101    T    T    T    T    T    T    T

 

and so on.

*contd*- added to show continuity in the data.

 

The columns indicate following information:

Column #1 - Chromosome number, and there are altogether 8 chromosomes

Column #2 - read position

Column #3 - nucleotide base at that particular position for the reference genome

Column #4....34 - indicates the nucleotide base at the particular position, which may be same or different from the reference (column #3). Most of the bases in the view match the reference but at other places it varies. N- indicates the missing base.

Question:

1) I want to create a vcf file in the following format and with the following header (tab separated).

CHROM POS  REF    ALT   FREQ

 

So, I will need to check if any allele varies compared to the reference. If it does not the ALT field will be blank for that read postion, if not I will have one or more alternate allele at that position.  Could anyone help me which tools would useful for this purpose? Also, please provide me with brief guidance on script/example.

2) When calculate the allele frequency I want to leave the the missing bases out of the calculation - that means if there were 34 biological samples and I had 10 reference, 14 missing (N), and 10 alternate type I and 10 alternate type 2 allele. I want to calculate the allele frequency of reference (which would be 0.33), alt type I (0.33) and alt type 2 (0.33) based solely on available reads (excluding the missing reads).

 

Thanks much in advance,

genome matrix snp vcf allele frequency • 1.5k views
ADD COMMENT
1
Entering edit mode

please define "handling this type of data". as most linux/unix tools are able to do the job by streaming data (pipeline)

ADD REPLY
0
Entering edit mode

By "Handling this type of data" I mean if there is any tool that can take the genome matrix file to do any analyses. Google search didnot land me any useful information on the tool that is available that is helpful in analyzing genome matrix files.

Additionally, I also cannot view the data structure (due to file being very large) which could have been helpful to understand what it really contains, so I was asking if there is some way of extracting the small portion of this large file.

Sorry for any confusion I may have caused.

ADD REPLY
1
Entering edit mode

The method I use to get a view of the data's structure is to head -n 10 <file>, then increase the -n gradually (10 -> 50 -> 100) until I understand how the data is structured.

ADD REPLY
0
Entering edit mode

I will try it sometime after I get chance to access the data. Thanks much for the code.

ADD REPLY
1
Entering edit mode
5.7 years ago

I don't know what is a "genome matrix file" but if it's readable text file which is gzipped, use

gunzip -c file.gz | head -20

to understad the structure of the file.

Other tools which help you to work out with large files are

wc -l (to count lines) and du -hs (to learn about the size)

Vim has a plugin called LargeFile, which also allows you to edit it on terminal itself.

ADD COMMENT

Login before adding your answer.

Traffic: 2305 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6