mono-allelic to bi-allelic (a unix tools question?)
1
0
Entering edit mode
2.2 years ago
Zhitian Wu ▴ 60

Hi, I am using PLINK to perform quality control of my genotype data. All of the genotypes are homozygous but some are labelled as missing.

This is what my data looks like

SNP1   A T C G C N

But PLINK requires the genotype to be bi-allelic, so I want it to be like this,

SNP1   A A T T C C G G C C N N

There are more than 10 million SNPs, so I wonder if there's the most efficient way to do this. So far, I only know a little about sed and Regex and this is my code.

sed -i 's/\([ATCGN]\)\>/\1\t\1/g' chr05.tped
sed Regex PLINK • 836 views
ADD COMMENT
0
Entering edit mode

this would change N in SNP1 too. For this example, I could come up with this:

$ echo 'SNP1 A T G C N' | sed -r 's/\w/& &/5g'

SNP1 A A T T G G C C N N

what is the field separator between SNP1 and bases? Do not use -i when you are not sure of the code.

ADD REPLY
0
Entering edit mode

Thanks for your reply. The content of this file is actually quite simple so I add a word anchor \> to avoid the expression matching the first column (SNP id).

The field operator is TAB, I type 3 more spaces between SNP1 and bases to make it look nicer here.

I can understand your expression, I think add a TAB and the same letter will be faster than replace the original letter? Is it possible to do this without replacement?

ADD REPLY
0
Entering edit mode

Do not post images of the data.

ADD REPLY
1
Entering edit mode
2.2 years ago
Zhitian Wu ▴ 60

For this problem, using awk to duplicate columns is much faster than using regular expression.

# list is a file like this
# $1,$1,$2,$2.....

awk -v OFS='\t' "{print $(< list)}" genotype-file
ADD COMMENT

Login before adding your answer.

Traffic: 2260 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6