Question: Extract vcf file header and save it in Key-Value csv
0
gravatar for tinyskinn
22 months ago by
tinyskinn0
tinyskinn0 wrote:

Hi everyone,

If I have a vcf file with header like this:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA0000

How can I extract (with bcftools) and save/print these key-value pairs to a csv csv file with python. My expected output would look something like:

fileformat,VCFv4.0
fileDate,20090805

etc...

Thanks for the anticipated response

sequencing snp next-gen genome • 1.8k views
ADD COMMENTlink modified 22 months ago • written 22 months ago by tinyskinn0
1

Hello tinyskinn ,

what is your final goal? When working with python on vcf files, you should take one of the existing modules like pysam or cyvcf2.

fin swimmer

ADD REPLYlink modified 22 months ago • written 22 months ago by finswimmer13k

Hello Fin, my final goal is to be able to have an annotated metadata as in enter link description here. And this would require me to automate the extraction of the header Key-Value info. I am trying to implement this in enter link description here. And thank you for the suggestions.

ADD REPLYlink modified 22 months ago • written 22 months ago by tinyskinn0
1

I don't understand how your expected output is comma separated for the first line, but separated by a = for the second line.

Also, what have you tried to solve this issue? Why do you want to use bcftools and python, exactly? Mentioning these suggest that you at least have an idea which tools/language can be used.

ADD REPLYlink written 22 months ago by WouterDeCoster44k

regarding your first question, I have corrected the error. Thanks! to your second question, I actually do not have to use bcftools or python. From my search, they were the ones that came up. And my goal is explained my above answer to @finswimmer

ADD REPLYlink modified 22 months ago • written 22 months ago by tinyskinn0
1
gravatar for Nicolas Rosewick
22 months ago by
Belgium, Brussels
Nicolas Rosewick9.0k wrote:

use classic shell for that (not tested though):

cat file.vcf | grep '^##' | sed 's/=/\,/g' | sed 's/#//g' > header.txt
ADD COMMENTlink modified 22 months ago • written 22 months ago by Nicolas Rosewick9.0k

It returned the original header including the ''##'. How about if the vcf is also .vcf.gz ?

ADD REPLYlink written 22 months ago by tinyskinn0

I added an additional sed. But I didn't try it though.

ADD REPLYlink written 22 months ago by Nicolas Rosewick9.0k

For .gz use gzip -d -c instead of cat

ADD REPLYlink written 22 months ago by Nicolas Rosewick9.0k

Or just zgrep instead of gzip+grep.

ADD REPLYlink written 22 months ago by finswimmer13k

I'd use grep '^#' instead to capture the header for the table

ADD REPLYlink written 12 weeks ago by timing10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1955 users visited in the last hour