Question: vcf split by columns
0
gravatar for cristina_sabiers
12 months ago by
Spain
cristina_sabiers40 wrote:

Well I hope I can explain myself well,

The vcf files I got have in the INFO column all data together (AF, MLLD,SVTYPE, GENE...) split by "; "

when I put on excel and try to split in columns, the columns dont follow each other like it should be, because all data isnt equal.

Is any easy way to do this? so I dont need to fix it manually?

I hadnt done this file but Im wondering mostly all VCF are like that??? instead have on the head the requiered info? (precise, svtype, func....)

CHROM   POS ID  REF ALT QUAL    FILTER  INFO      FORMAT

chr1    68928   .   T   <CNV>   100.0   PASS    PRECISE=FALSE;SVTYPE=CNV;END=69412;LEN=484;NUMTILES=4;CONFIDENCE=0;PRECISION=2.04702;FUNC=[{'gene':'OR4F5'}]    GT:GQ:CN    ./.:0:2

chr1    871334  .   G   T   431.84  PASS    AF=0.488263;AO=104;DP=213;FAO=104;FDP=213;FR=.;FRO=109;FSAF=68;FSAR=36;FSRF=67;FSRR=42;FWDB=0.00178582;FXX=0;HRUN=2;LEN=1;MLLD=52.4929;QD=8.10962;RBI=0.00909038;REFB=-0.0299934;REVB=-0.00891325;RO=109;SAF=68;SAR=36;SRF=67;SRR=42;SSEN=0;SSEP=0;SSSB=0.0396347;STB=0.521838;STBP=0.536;TYPE=snp;VARB=0.0325404;OID=.;OPOS=871334;OREF=G;OALT=T;OMAPALT=T;FUNC=[{'transcript':'NM_152486.2','gene':'SAMD11','location':'intronic'}]   GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   0/1:99:213:213:109:109:104:104:0.488263:36:68:67:42:36:68:67:42

Thanks

vcf • 596 views
ADD COMMENTlink modified 12 months ago by Vitis1.6k • written 12 months ago by cristina_sabiers40
1

While the split can be done using sed etc you would have the same problem (of unequal # of columns) if you try to import the data into excel :)

ADD REPLYlink written 12 months ago by genomax39k

Thanks genomax....

well..the fact is I dont want to have on excel, I thought use excel and get back my vcf file with the modify head and all INFO splited by columns

ADD REPLYlink written 12 months ago by cristina_sabiers40
2
gravatar for Pierre Lindenbaum
12 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum102k wrote:

try to use GATK VariantsToTable https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php to convert your VCF to a tabular format.

ADD COMMENTlink written 12 months ago by Pierre Lindenbaum102k

thanks Pierre, I will try it , hope so, my head is popping out today XD

ADD REPLYlink written 12 months ago by cristina_sabiers40
2
gravatar for Jorge Amigo
12 months ago by
Jorge Amigo10.0k
Santiago de Compostela, Spain
Jorge Amigo10.0k wrote:

here's a bcftools alternative. if you are interested in a set of variables rather than all of them, bcftools query can help you if you play around with the %INFO/TAG variable:

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO/AC\t%INFO/AF\n' file.vcf.gz

but if you need all the variables present in the INFO column, then you'll have to read the file first to know which fields are present in it, and then you'll have to query the file using those fields previously extracted. if the vcf is well defined you can extract them out of the file headers:

FIELDS=$(bcftools view -h file.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -f "%CHROM\t%POS\t%REF\t%ALT$FIELDS\n" file.vcf.gz

if not, you'll have to read the whole file looking for all the fields, and then query the file:

FIELDS=$(zgrep -v ^# file.vcf.gz \
| head -n 1000 \
| perl -ane '
while ($F[7] =~ /([^=;]+)=/g) { $fs{$1} = 1 }
END { for $f (sort keys %fs) { print "\\t\%INFO/$f" } }')
bcftools query -Hf "%CHROM\t%POS\t%REF\t%ALT$FIELDS\n" file.vcf.gz

the 1000 lines head is there just in case the file you're querying is too big, assuming that the first 1000 lines would contain all the present fields. you can remove it to be sure that you're working will absolutely all fields, but it can take time to process large files.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Jorge Amigo10.0k

thanks to all!

today that I have free day from work so I can try all your tips...thank you so much. Sadly yesterday my head wasn't in good shape :P

ADD REPLYlink modified 12 months ago • written 12 months ago by cristina_sabiers40

well I used Jorge'ś code, works like charm :)))

FIELDS=$(bcftools view -h result.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -f "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER$FIELDS\n" result.vcf.gz > RESULTSexcel.vcf

Just one small thing, well two....In my excel table doesnt appear the field where says what is each column, would be really nice to have that info :) . And dunno why (probably I wrote wrong st) the column FORMAT dont appear.

my head: CHROM  POS  ID  REF ALT QUAL  FILTER   INFO     FORMAT

I tried also:

FIELDS=$(bcftools view -h result.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -f "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FORMAT\t%FILTER$FIELDS\n" V42.vcf.gz > RESULTSSSexcel

Error: no such tag defined in the VCF header: INFO/FORMAT

Thanks for all the help!!!

ADD REPLYlink written 12 months ago by cristina_sabiers40

the first one is my fault. in order to get the headers from bcftools query you have to add option -H:

FIELDS=$(bcftools view -h result.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -f "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER$FIELDS\n" result.vcf.gz > RESULTSexcel.vcf

the second one is your fault. firstly, you have to use the same vcf file name both at the $FIELDS definition and at the bcftools query; secondly, FORMAT is indeed not a column that bcftools query prints directly. it uses that column to print sample information such as genotype, depth,...

FIELDS=$(bcftools view -h V42.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -Hf "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FORMAT\t%FILTER$FIELDS\n" V42.vcf.gz > RESULTSSSexcel
ADD REPLYlink written 12 months ago by Jorge Amigo10.0k

thanks for the information but then whats happen with the column that has the information of homozygous (0:0)...hetero....?

How I can get that field? I mean after INFO I have two colums more, I asume Format is the first one, whats happens with the second above?

GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   

0/1:99:165:164:83:83:81:81:0.493902:31:50:52:31:31:50:52:31

THANKS!! (y gracias para que nos entendamos....que mi ingles...)

ADD REPLYlink modified 12 months ago by genomax39k • written 12 months ago by cristina_sabiers40
1

the FORMAT column just specifies what type of data do the samples columns contain. you can query any of that FORMAT column fields by calling them into brackets (everything is perfectly described in the bcftools webpage). for instance, if you want to get the genotypes for all samples you'll just have to add [ %GT] at the end of the query:

FIELDS=$(bcftools view -h V42.vcf.gz \
| perl -ne 'if (/^##INFO=<ID=([^,]+)/) { print "\\t\%INFO/$1" }')
bcftools query -Hf "%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FORMAT\t%FILTER$FIELDS[\t%GT]\n" V42.vcf.gz > RESULTSSSexcel
ADD REPLYlink modified 12 months ago • written 12 months ago by Jorge Amigo10.0k

WELL, now I start to understand a bit more how this works, I had managed to even add a column on my own just with the gene info :)))

The only thing I dont know is how to make appear the column with 0/0 1/1 just appear the value zero (supose to be GT column name right?)

thanks

ADD REPLYlink written 12 months ago by cristina_sabiers40
1
gravatar for Claire Malley
12 months ago by
Johns Hopkins University, Baltimore MD
Claire Malley20 wrote:

I never open VCF data in Excel. Here is an R based solution using fread() that works fast on large VCF files. You just need to know how many lines the header takes up before the essential line #CHROM POS...etc.

require(data.table)
vcf.data <- fread("file.vcf", header=T, skip=10L, data.table=F, sep="\t")

To extract just the INFO column and split its elements by semicolon:

require(stringr)
info.df <- data.frame(do.call(rbind, strsplit(as.vector(vcf.data$INFO), split = ";")))
ADD COMMENTlink written 12 months ago by Claire Malley20
1

thanks to all!

today that I have free day off work can try all your tips...thank you so much :))))

ADD REPLYlink written 12 months ago by cristina_sabiers40
1
gravatar for Vitis
12 months ago by
Vitis1.6k
New York
Vitis1.6k wrote:

For relatively small queries, for example, limited number of SNP sites, I usually use PyVCF, which is the python interface to VCF. Programing with PyVCF can give you almost everything from the VCF. However, it doesn't work very well with super-large multi-sample VCF files. The nice thing about PyVCF also lies you could embed the code in your python applications.

https://github.com/jamescasbon/PyVCF

ADD COMMENTlink written 12 months ago by Vitis1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1303 users visited in the last hour