print only columns with data from every line
0
0
Entering edit mode
2.7 years ago
HL ▴ 10

Hi, I have a vcf file where is about 60 000 columns. Here is example of the first three lines:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  10022-20416-17  10024-34469-18A 10025-34469-18B 10034-31625-18A 10035-31625-18B 10036-31625-18C 10042-29083-18  10044-34485-18A 10045-34485-18B 10046-34485-18C 10069-33802-18  10070-20895-17  10072-20901-17  10074-20904-17  10080-20908-17  10109-34224-18  1011-22957-18   10118
2       179391728       .       C       T       1109.77 PASS    BaseQRankSum=-2.601;ClippingRankSum=0;ExcessHet=3.0103;FS=0;MQ=60;MQRankSum=0;QD=11.81;ReadPosRankSum=0.626;SOR=0.76;DP=95;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1   GT:AD:DP:GQ:PL  ./.:.:.:.:.     ./.:.:.:.:.     0/1:44,47:91:99:1053,0,1069     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.
2       179391738       .       C       G       2090.77 PASS    BaseQRankSum=0.25;ClippingRankSum=0;ExcessHet=3.0103;FS=2.282;MQ=60;MQRankSum=0;QD=14.32;ReadPosRankSum=0.857;SOR=0.953;DP=370;AF=0.5;MLEAC=1;MLEAF=0.5;AN=6;AC=3       GT:AD:DP:GQ:PL  ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     0/1:88,68:156:99:2586,0,4687     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.

So there is many different sample numbers as columns and there is for every sample column there is some information at some variant. I would like to get the output so that there would only show that column where is information for every line like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  10025-34469-18B
2       179391728       .       C       T       1109.77 PASS    BaseQRankSum=-2.601;ClippingRankSum=0;ExcessHet=3.0103;FS=0;MQ=60;MQRankSum=0;QD=11.81;ReadPosRankSum=0.626;SOR=0.76;DP=95;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1   GT:AD:DP:GQ:PL    0/1:44,47:91:99:1053,0,1069

It would also be important to see the sample number in the headers that includes this GT:AD:DP:GQ:PL info. I think this would be possible somehow with awk, but I just don't know how. It would be really good if this is possible to be done with unix.

unix awk • 1.2k views
ADD COMMENT
0
Entering edit mode

I don't understand the difference between the two examples.

ADD REPLY
0
Entering edit mode

In the end of every line there is removed all the columns that has empty genotype informations.

ADD REPLY
0
Entering edit mode

I don't know if I correctly understood your question... Do you want to output lines where all your 60 000 patients have been genotyped?

ADD REPLY
0
Entering edit mode

No, those I can print now, but I would like to print for every line just the one column where is the genotype informations and not the rest 60 000 that are empty columns. Because now the end of every line is something like this

./.:.:.:.:.     ./.:.:.:.:.     0/1:44,47:91:99:1053,0,1069     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.

and I want to print only the column where is this 0/1:44,47:91:99:1053,0,1069.

ADD REPLY
0
Entering edit mode

hum... sounds like a xy problem. What do you want to do at the end ?

ADD REPLY
0
Entering edit mode

I would like to have a file where is not these "./.:.:.:.:." empty columns and every variant would have their own genotype informations printed in the end of every line like this.

2       179391728       .       C       T       1109.77 PASS    BaseQRankSum=-2.601;ClippingRankSum=0;ExcessHet=3.0103;FS=0;MQ=60;MQRankSum=0;QD=11.81;ReadPosRankSum=0.626;SOR=0.76;DP=95;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1   GT:AD:DP:GQ:PL    0/1:44,47:91:99:1053,0,1069

In the end of headers it's okay to have all the different samples, but not necessary.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  10022-20416-17  10024-34469-18A 10025-34469-18B 10034-31625-18A 10035-31625-18B 10036-31625-18C 10042-29083-18  10044-34485-18A 10045-34485-18B 10046-34485-18C 10069-33802-18  10070-20895-17  10072-20901-17  10074-20904-17  10080-20908-17  10109-34224-18  1011-22957-18   10118

So basically just if the column has "./.:.:.:.:." it should not be printed.

ADD REPLY
0
Entering edit mode

yes but WHY ????!!! you could just split VCFs per sample Splitting vcf files to individual samples and then use "bcftools view --exclude-uncalled " to keep the called .

ADD REPLY
0
Entering edit mode

If I would split the file by samples and every sample makes a new file, then I would have over 60 000 different files that does not sound very nice to go through.

ADD REPLY

Login before adding your answer.

Traffic: 3001 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6