Extract only first individual from vcf files
1
0
Entering edit mode
2.2 years ago
User000 ▴ 690

Dear all,

I have multiple vcf files like this, each containing 3 individuals. I would like to extract only the first individual from each vcf and save each as separate vcf file. Is this possible in batch without specifing the individual name as I have many vcfs?

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NAME1   NAME2   NAME3
chr1    1666251 .   G   A   139 MQ  BRF=0.46;FR=0.1667  GT:GL:GOF:GQ:NR:NV  1/0:-143.73,0.0,-79.13:20:99:100:60 0/0:0.0,-12.67,-236.1:12:99:117:2   0/0:0.0,-16.47,-244.0:13:99:129:1
chr1    2408213 .   T   G   164 QD  BRF=0.05;FR=0.1667  GT:GL:GOF:GQ:NR:NV  1/0:-21.02,0.0,-55.12:44:99:77:24   0/0:0.0,-13.94,-147.2:5:99:58:0 0/0:0.0,-8.73,-90.8:3:87:37:0
chr1    2408232 .   T   G   122 QD      BRF=0.04;FR=0.1667  GT:GL:GOF:GQ:NR:NV  1/0:-16.82,0.0,-33.02:51:99:57:23   0/0:0.0,-10.38,-111.1:9:99:37:0 0/0:0.0,-5.92,-60.6:8:59:25:0

Expected output:

   #CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NAME1
chr1    1666251 .   G   A   139 MQ  BRF=0.46;FR=0.1667  GT:GL:GOF:GQ:NR:NV  1/0:-143.73,0.0,-79.13:20:99:100:6
chr1    2408213 .   T   G   164 QD  BRF=0.05;FR=0.1667  GT:GL:GOF:GQ:NR:NV  1/0:-21.02,0.0,-55.12:44:99:77:24
chr1    2408232 .   T   G   122 QD      BRF=0.04;FR=0.1667  GT:GL:GOF:GQ:NR:NV  1/0:-16.82,0.0,-33.02:51:99:57:23
vcf • 1.7k views
ADD COMMENT
1
Entering edit mode

simple:

$ cut -f1-10 -d$'\t' input.vcf

For many:

$  parallel --plus --dry-run  cut -f 1-10 -d\$\'\t\' {} ">" {.}_out.vcf ::: *.vcf
ADD REPLY
0
Entering edit mode

sure, but it could be vcf.gz or bcf and bcftools will fix AC.

ADD REPLY
0
Entering edit mode

user can use zcat for gz.

ADD REPLY
0
Entering edit mode

Is the first individual's ID always "NAME1"? Something like this:

bcftools view -s NAME1 filename.vcf.gz
ADD REPLY
0
Entering edit mode

no, that's the problem..

ADD REPLY
0
Entering edit mode

I see, then go with Pierre's solution.

ADD REPLY
3
Entering edit mode
2.2 years ago
find dir1 dir2 -type f -name "*.vcf.gz" | while read F
do
   bcftools query -l "${F}"| head -n 1 > samples.txt
   bcftools view --samples-file samples.txt -O z -o "${F}.firstsample.vcf.gz" "${F}"
done
ADD COMMENT
0
Entering edit mode

thanks, could you please explain a bit the code? what is dir1 dir2 i.e?

ADD REPLY
1
Entering edit mode

dir1 and dir2 are your directories where you have VCFs. Find vcfs, loop through them using while, get the first sample ID, then subset.

ADD REPLY

Login before adding your answer.

Traffic: 2773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6