Question: Extracting certain columns from VCF file
0
gravatar for gradstudentNew
6 months ago by
gradstudentNew10 wrote:

Hello all,

I've been recently trying to extract only certain columns with vcftools of an annovar-run VCF file. I did the following command: vcftools --vcf file_ANNOVAR.vcf --recode-INFO ExAC_SAS_AF --recode-INFO rs_dbSNP147 --out OUTPUT.vcf but it unfortunately isn't working. Does any one have any tips on what else I could try? I don't know what the column # is because the file is too big to open on my computer (I'm doing everything via SSH).

ADD COMMENTlink modified 6 months ago by Kevin Blighe33k • written 6 months ago by gradstudentNew10
1

if you want GUI based program this is the one to use

ADD REPLYlink written 6 months ago by Chirag Parsania1.2k

Please post input vcf (with headers and few example records) and the columns you want to extract @OP

ADD REPLYlink written 6 months ago by cpad011210k
1

Hey guys, I ended up using some perl scripting to fix my issue. I realized that everything was being printed in the 9th column i.e. Exac|gnomad|..|..| so I ended up spliting that column and then pasting / joining the ones I needed. :) Thank you all for the help!

ADD REPLYlink written 6 months ago by gradstudentNew10

You're welcome dude

ADD REPLYlink written 6 months ago by Kevin Blighe33k
1
gravatar for Kevin Blighe
6 months ago by
Kevin Blighe33k
Republic of Ireland
Kevin Blighe33k wrote:

You need to switch from VCFtools to BCFTools, in partcular, bcftools query.

It looks like you not only want certain columns but also certain key-value pairs within the primary VCF columns, which are tab-delimited.

Here are examples that will assist you from one of my own VCFs:

bcftools query -f'[%CHROM:%POS %GT\n]' 2701.snvindel.var.vcf.gz | head -5
1:69511 1/1
1:69761 0/1
1:752721 0/1
1:752894 1/1
1:762273 0/1

.

bcftools query -f'[%CHROM:%POS:%REF:%ALT %SAMPLE %GT\n]' 2701.snvindel.var.vcf.gz | head -5
1:69511:A:G 2701 1/1
1:69761:A:T 2701 0/1
1:752721:A:G 2701 0/1
1:752894:T:C 2701 1/1
1:762273:G:A 2701 0/1

Should be fairly obvious what those are doing. To extract certain values from the INFO column, which is what you appear to have to do, you can do the following:

bcftools query -f'[%CHROM:%POS:%REF:%ALT %INFO/HaplotypeScore:%INFO/VQSLOD %SAMPLE %GT\n]' 2701.snvindel.var.vcf.gz | head -5
1:69511:A:G 0.9159:-6.231 2701 1/1
1:69761:A:T 0:-9.034 2701 0/1
1:752721:A:G 0:-1.447 2701 0/1
1:752894:T:C 0:-6.798 2701 1/1
1:762273:G:A 5.3647:-2.236 2701 0/1

Here, HaplotypeScore and VQSLOD are tags define din my INFO field.

Kevin

ADD COMMENTlink written 6 months ago by Kevin Blighe33k

I'm really new to bioinformatics, so thank you so much for your help! I tried doing that and it said that the column(s) didn't exist. I'm not sure whether it's because of how my VCF file is formatted? My info header looks like this:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature

Do you think the "|" is affecting anything?

ADD REPLYlink modified 6 months ago by genomax59k • written 6 months ago by gradstudentNew10

Ah! In this case, the key value is called Description (%INFO/Description), so, bcftools query will only be able to extract the entire string that contains all of your annotation.

You can still, nevertheless, do that and then do some post filtering with cut, sed, awk, or other commands. How is your experience with these commands?

ANNOVAR can output in CSV format, by the way. That would be much easier for you, surely?

ADD REPLYlink written 6 months ago by Kevin Blighe33k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1452 users visited in the last hour