Question

mono-allelic to bi-allelic (a unix tools question?)

0

Entering edit mode

2.2 years ago

Zhitian Wu ▴ 60

Hi, I am using PLINK to perform quality control of my genotype data. All of the genotypes are homozygous but some are labelled as missing.

This is what my data looks like

SNP1   A T C G C N

But PLINK requires the genotype to be bi-allelic, so I want it to be like this,

SNP1   A A T T C C G G C C N N

There are more than 10 million SNPs, so I wonder if there's the most efficient way to do this. So far, I only know a little about sed and Regex and this is my code.

sed -i 's/\([ATCGN]\)\>/\1\t\1/g' chr05.tped

sed Regex PLINK • 836 views

ADD COMMENT • link 2.2 years ago by Zhitian Wu ▴ 60

0

Entering edit mode

this would change N in SNP1 too. For this example, I could come up with this:

$ echo 'SNP1 A T G C N' | sed -r 's/\w/& &/5g'

SNP1 A A T T G G C C N N

what is the field separator between SNP1 and bases? Do not use -i when you are not sure of the code.

ADD REPLY • link 2.2 years ago by cpad0112 21k

0

Entering edit mode

Thanks for your reply. The content of this file is actually quite simple so I add a word anchor \> to avoid the expression matching the first column (SNP id).

The field operator is TAB, I type 3 more spaces between SNP1 and bases to make it look nicer here.

I can understand your expression, I think add a TAB and the same letter will be faster than replace the original letter? Is it possible to do this without replacement?

ADD REPLY • link 2.2 years ago by Zhitian Wu ▴ 60

0

Entering edit mode

Do not post images of the data.

ADD REPLY • link 2.2 years ago by cpad0112 21k

score 1 · Answer 1 · 2022-03-02

1

Entering edit mode

2.2 years ago

Zhitian Wu ▴ 60

For this problem, using awk to duplicate columns is much faster than using regular expression.

# list is a file like this
# $1,$1,$2,$2.....

awk -v OFS='\t' "{print $(< list)}" genotype-file

ADD COMMENT • link 2.2 years ago by Zhitian Wu ▴ 60