Delete ID only if not presnt in next line (lines are in pair)
2
0
Entering edit mode
6.5 years ago
waqasnayab ▴ 250

Dear Community,

My this question is the follow-up / extension of C: Copy Sample ID from VCF file to ID column

I have a file like:

rsid1:sample1,sample2,sample3
rsid2:sample1,sample5
rsid3:sample4, sample6
rsid4:sample6

Each line is paired with the second line, that is , rsid1 and rsid2 are paired, rsid3 and rsid4 are paired and so on. I want to print only those samples that are present in a pair, so the desired output would be:

rsid1:sample1
rsid2:sample1
rsids3:sample6
rsid4:sample6

or is it possible to get the direct output like this from a VCF file (answered by PIERRE in the above link, but last time I did not ask this pair feature).

Any help appreciated.

Thanks,

Waqas.

SNP sequencing • 1.4k views
ADD COMMENT
6
Entering edit mode
6.5 years ago

using awk and sort:

cat input.txt |\
awk -F ':' '{n=split($2,a,/[, ]+/);for(i=1;i<=n;i++) {printf("%d %s %s\n",int((NR-1)/2),$1,a[i]);}}' |\
sort -t ' ' -k1,1 -k3,3   |\
awk '{if(NR>1 && i1==$1 && i3==$3 ) printf("%s:%s\n%s:%s\n",i2,$3,$2,$3);i1=$1;i2=$2;i3=$3;}'


rsid1:sample1
rsid2:sample1
rsid3:sample6
rsid4:sample6
ADD COMMENT
1
Entering edit mode

Pierre or the man who always has awk-one-liners for your case! :D

ADD REPLY
0
Entering edit mode

Yes, it worked really very well, in fact perfect..,,,,,!!!!

Thanks Pierre

ADD REPLY
0
Entering edit mode
6.5 years ago
$ awk -F: '{split($2,a,",");for(i in a)print $1"\t"a[i]}' test | sort -k2  | uniq -D -f1 | sed 's/\t/:/g'



 rsid1:sample1
 rsid2:sample1
 rsid3:sample6
 rsid4:sample6

input (test):

rsid1:sample1,sample2,sample3
rsid2:sample1,sample5
rsid3:sample4,sample6
rsid4:sample6
ADD COMMENT

Login before adding your answer.

Traffic: 3824 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6