Question

Gen file id reordering

0

Entering edit mode

8.9 years ago

maroisf ▴ 10

Hello Biostars,

I recently imputed genotyped dataset using SHAPEIT and IMPUTE2. The resulting files are of the gen/sample format (Oxford). Because I am using the latest 1000 genome reference dataset I end up with massive gen files: chr is ~130 Gb. I would like to reorder the ids in the produced gen/sample file. looked in software such as GTOOL or QCTOOL for such options but I cannot find any.

Is anyone aware of a software that could do such a thing or do I have to code something myself?

Thank you

François

Imputation genome SNP • 2.5k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by maroisf ▴ 10

Ram · Answer 1 · 2015-06-18

0

Entering edit mode

8.9 years ago

biocyberman ▴ 860

Assuming the gene ids are at the first column, and the file is plain text. You can do this in linux:

#extract the header line:
head -n 1 unsorted.file.txt >header.txt

# test first 30 lines and see how it looks
head -n 30|sed -e'1d' |sort --human-numeric-sort --tempary-directory ./tmp --parallel 10 --key 1  >sorted.file.tsv

# read more about 'sort' command if neccessary to modify the output.
man sort

# final sort command:
sed -e'1d' |sort --human-numeric-sort --tempary-directory ./tmp --parallel 10 --key 1  >sorted.file.tsv

# prepend the header line with sed:
sed -i -e "1i $(cat header.txt)" sorted.file.tsv

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by biocyberman ▴ 860

0

Entering edit mode

Thank you for your answer.

I'm sorry I think I was not clear on my question.

Here is an explanation taken from the oxford website on Gen file format:

Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are

SNP 1 : AA AA
SNP 2 : GG GT
SNP 3 : CC CT
SNP 4 : CT CT
SNP 5 : AG GG

The correct gen file would be

SNP1 rs1 1000 A C 1 0 0 1 0 0
SNP2 rs2 2000 G T 1 0 0 0 1 0
SNP3 rs3 3000 C T 1 0 0 0 1 0
SNP4 rs4 4000 C T 0 1 0 0 1 0
SNP5 rs5 5000 A G 0 1 0 0 0 1

Along with the Gen file is a SAMPLE file :

ID_1 ID_2 missing cov_1 cov_2 cov_3 cov_4 pheno1 bin1
0 0 0 D D C C P B
1 1 0.007 1 2 0.0019 -0.008 1.233 1
2 2 0.009 1 2 0.0022 -0.001 6.234 0

The sample file ID (person) order corresponds to the Gen file column order. Also each ID of the sample file is associated to 3 columns in the Gen file. Therefore in this example the column 6,7 and 8 of the Gen file correspond to the ID 1 and columns 9,10 and 11 correspond to the ID 2.

My problem is that I would like to change the order of the ID in the sample file and hence the column order in the Gen file. Keeping in mind that I have over 3000 IDS in my sample file and over 6 000 000 lines X 10 000 columns in my gen file.

Thank you

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by maroisf ▴ 10