Question: Gen file id reordering
0
gravatar for maroisf
5.3 years ago by
maroisf10
Canada
maroisf10 wrote:

Hello Biostars,

I recently imputed genotyped dataset using SHAPEIT and IMPUTE2. The resulting files are of the gen/sample format (Oxford). Because I am using the latest 1000 genome reference dataset I end up with massive gen files :  chr is ~130 Gb. I would like to reorder the ids in the produced gen/sample file.  looked in software such as GTOOL or QCTOOL for such options but I cannot find any.

Is anyone aware of a software that could do such a thing or do I have to code something myself?

Thank you

François

snp imputation genome • 1.5k views
ADD COMMENTlink modified 5.3 years ago by biocyberman810 • written 5.3 years ago by maroisf10
0
gravatar for biocyberman
5.3 years ago by
biocyberman810
Denmark
biocyberman810 wrote:

assuming the gene ids are at the first column, and the file is plain text. You can do this in linux:

    #extract the header line:

    head -n 1 unsorted.file.txt >header.txt

    # test first 30 lines and see how it looks

    head -n 30|sed -e'1d' |sort --human-numeric-sort --tempary-directory ./tmp --parallel 10 --key 1  >sorted.file.tsv

    # read more about 'sort' command if neccessary to modify the output.

    man sort

    # final sort command:

    sed -e'1d' |sort --human-numeric-sort --tempary-directory ./tmp --parallel 10 --key 1  >sorted.file.tsv

    # prepend the header line with sed:

     sed -i -e "1i $(cat header.txt)" sorted.file.tsv

 

   

 

   

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by biocyberman810

Thank you for your answer.

I'm sorry I think I was not clear on my question.

Here is an explanation taken from the oxford website on Gen file format:

############################################

Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are

SNP 1 : AA AA
SNP 2 : GG GT
SNP 3 : CC CT
SNP 4 : CT CT
SNP 5 : AG GG

The correct gen file would be

SNP1 rs1 1000 A C 1 0 0 1 0 0
SNP2 rs2 2000 G T 1 0 0 0 1 0
SNP3 rs3 3000 C T 1 0 0 0 1 0
SNP4 rs4 4000 C T 0 1 0 0 1 0
SNP5 rs5 5000 A G 0 1 0 0 0 1

##########################################

Along with the Gen file is a SAMPLE file :

##########################################

ID_1 ID_2 missing cov_1 cov_2 cov_3 cov_4 pheno1 bin1
0 0 0 D D C C P B
1 1 0.007 1 2 0.0019 -0.008 1.233 1
2 2 0.009 1 2 0.0022 -0.001 6.234 0
##########################################

The sample file ID (person) order corresponds to the Gen file column order. Also each ID of the sample file is associated to 3 columns in the Gen file. Therefore in this example the column 6,7 and 8 of the Gen file correspond to the ID 1 and columns 9,10 and 11 correspond to the ID 2.

My problem is that I would like to change the order of the ID in the sample file and hence the column order in the Gen file. Keeping in mind that I have over 3000 IDS in my sample file and over 6 000 000 lines X 10 000 columns in my gen file.

Thank you
 

 

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by maroisf10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1984 users visited in the last hour