Question: Phased Genotypes information
0
gravatar for paolo002
6 months ago by
paolo002140
paolo002140 wrote:

Hi all, first of all I apologise in case the following has been asked in previous posts but I am not able to find the solution to my problem.

I have a data frame with a list of SNPs, their locations and the information of the Reference (REF) and Alternate (ALT) alleles. In addition I have information about the phased genotypes for a list of various individuals.

Example

   SNP       CHR    POS       REF  ALT  ID1    ID2   ID3 
rs2754554    1      8656      A    C    0|1    0|0   1|1
rs1111786    16     975544    T    A    0|0    0|1   1|0
rs986355     7      75987     G    T    1|1    0|1   1|1
rs 2256743   21     442324    G    C    1|0    0|1   0|1

In the example I have only 4 SNPs and 3 individuals but the list is much larger. I would like to modify the genotype information to be replaced with the corresponding alleles based on the information of the REF and ALT columns:

Desired output:

 SNP       CHR      POS       REF  ALT  ID1    ID2   ID3 
rs2754554    1      8656      A    C    A|C    A|A   C|C
rs1111786    16     975544    T    A    T|T    T|A   A|T
rs986355     7      75987     G    T    T|T    G|T   T|T
rs2256743    21     442324    G    C    C|G    G|C   G|C

The output is based on my understanding that if it is 0 it means equal to reference while 1 equals to alternate. Any help highly appreciated.

linux R • 263 views
ADD COMMENTlink modified 6 months ago by RamRS22k • written 6 months ago by paolo002140

Hi, thanks for your solutions to the problem, sorry the dots are there by mistake...so do not consider them, the rest is correct.thanks!

ADD REPLYlink written 6 months ago by paolo002140
2
gravatar for Pierre Lindenbaum
6 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

using awk (I don't understand the meaning of those dots...)

awk  '(NR==1) {print; next;}{printf("%s %s %s %s %s",$1,$2,$3,$4,$5);for(i=6;i<=NF;i++) {split($i,a,/[\|\.]/);printf("\t%s|%s",(a[1]=="0"?$4:$5),(a[2]=="0"?$4:$5));} printf("\n");}' input.txt
SNP.       CHR.    POS.     REF  ALT  ID1  ID2 ID3 
rs2754554. 1. 8656. A. C    A.|C    A.|A.   C|C
rs1111786. 16. 975544 T A   T|T T|A A|T
rs986355. 7. 75987. G. T.   T.|T.   G.|T.   T.|T.
rs2256743. 21. 442324. G. C C|G.    G.|C    G.|C
ADD COMMENTlink written 6 months ago by Pierre Lindenbaum121k
2
gravatar for finswimmer
6 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

What are these dots?

$ awk -v OFS="\t" 'NR>1 { for(i=6;i<=NF;i++) {gsub("0",$4,$i); gsub("1",$5,$i) }}1' input.txt|sed 's|\.\.|\.|g'

EDIT: Of course Pierre was faster ;)

ADD COMMENTlink modified 6 months ago • written 6 months ago by finswimmer11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 721 users visited in the last hour