Multisample vcf to table
4
1
Entering edit mode
5.2 years ago
Myggan ▴ 20

Hi,

I've used freebayes for calling variants on a set of bamfiles with 1 sample in each bam file. I want to somehow get a good overview over the variants that also non-bioinformaticians can look at.

The VCF file looks something like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4 CONTROL
chr1       46      .       ACACC   CCACA   8.58901e-15     .       AB=0;ABP=0;AC=0;AF=0;AN=5;AO=2;CIGAR=1X3M1X;DP=137;DPB=147;DPRA=0.314961;EPP=3.0103;EPPR=17.5948;GTI=0;LEN=5;MEANALT=1;MQM=33;MQMR=54.4254;NS=5;NUMALT=1;ODDS=78.0575;PAIRED=1;PAIREDR=0.977612;PAO=0;PQA=0;PQR=506;PRO=17;QA=14;QR=4464;RO=134;RPP=7.35324;RPPR=77.9423;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=83;SRP=19.6042;SRR=51;TYPE=complex;technology.ILLUMINA_HiSeq2500=1    GT:DP:RO:QR:AO:QA:GL    0:81:80:2614:0:0:0,-10  0:6:6:218:0:0:0,-10     0:10:8:262:2:14:0,-10   0:19:19:671:0:0:0,-10   0:21:21:699:0:0:0,-10
chr1       2146    .       G       A       9.16135e-15     .       AB=0;ABP=0;AC=0;AF=0;AN=5;AO=28;CIGAR=1X;DP=251;DPB=251;DPRA=0;EPP=5.80219;EPPR=11.1996;GTI=0;LEN=1;MEANALT=1;MQM=18.9643;MQMR=57.3318;NS=5;NUMALT=1;ODDS=92.9516;PAIRED=0.964286;PAIREDR=0.977578;PAO=0;PQA=0;PQR=0;PRO=0;QA=633;QR=8052;RO=223;RPP=63.8115;RPPR=5.82445;RUN=1;SAF=11;SAP=5.80219;SAR=17;SRF=99;SRP=9.09627;SRR=124;TYPE=snp;technology.ILLUMINA_HiSeq2500=1   GT:DP:RO:QR:AO:QA:GL    0:145:136:4920:9:154:0,-10      0:27:22:788:5:129:0,-10 0:16:13:471:3:75:0,-10  0:13:9:333:4:100:0,-10  0:50:43:1540:7:175:0,-10

I've tried using both gatk and snpsifts variantotable/extractfields tools, but it will only sum all the INFO counts and not keep them separated.

My goal is to get something like this:

CHR POS EFF.FIELD1 REF ALT QUAL SAMPLE1.RO SAMPLE1.AO SAMPLE2.RO SAMPLE2.AO SAMPLE3.RO SAMPLE3.AO SAMPLE4.RO SAMPLE4.AO CONTROL.RO CONTROL.AO
chr1 46 "geneX" ACACC CCACA 8.58901e-15 80 0 6 0 8 2 19 0 21 0

But the current options would give something like:

 CHR POS EFF.FIELD1 REF ALT QUAL RO AO
 chr1 46 "geneX" ACACC CCACA 8.58901e- 134 2

My feeling is that this should be fairly common, so just want to make sure I'm not reinventing the wheel here. Anyone have any idea?

sequencing variant calling SNP • 3.8k views
ADD COMMENT
4
Entering edit mode
5.2 years ago
Lee Katz ★ 3.1k

You should use the latest version of bcftools, and you should get exactly what you want. However you would have to take time to make the -f string perfect. For example:

bcftools query --print-header -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT\t%RO\t%AO]\n' file.vcf.gz
ADD COMMENT
1
Entering edit mode

This worked very well! Thanks a lot!

ADD REPLY
0
Entering edit mode

Thanks it sounds like it could do the trick! I'll try it today, thanks!

ADD REPLY
1
Entering edit mode
3.4 years ago
kirannbishwa01 ★ 1.3k

Try vcf_simplify-v1.py https://github.com/everestial/vcf_simplify

ADD COMMENT
0
Entering edit mode
5.2 years ago
Ming Tang ★ 2.6k

Have you seen this? https://www.broadinstitute.org/gatk/blog?id=7089

ADD COMMENT
0
Entering edit mode

I did try the VariantsToTable from gatk, but it won't keep my samples separated from what I can see as it sums upp the observations for each sample. Should it be able to do that?

ADD REPLY

Login before adding your answer.

Traffic: 1550 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6