Question

Select Certain Column From Vcf File

2

Entering edit mode

12.0 years ago

michealsmith ▴ 790

I have a zebrafish SNP vcf file at hand; in the genotype field, there are information for six strains. However, I'm only interested in TA and TUB strain. Then how can I select these two columns only?

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    AB    TLF    TUB    TUG    WKB    WKG
chr10    587    .    G    A,C    106    PASS    AF1=0.9999;CI95=0.5833,1;DP=9;DP4=0,0,1,5;FQ=-29.7;MQ=60    GT:PL:GQ    1/1:35,3,0,35,3,35:39    1/1:0,0,0,0,0,0:36    1/1:44,6,0,44,6,44:42    1/1:45,11,8,37,0,34:39    1/1:0,0,0,0,0,0:36    1/1:26,3,0,26,3,26:39

I tried to used vcf-subset ; but doesnt' work producing "multiple read header error"

Or can I simply use some bash/python to manipulate?

Also, if I wanna select against the genotype, say, to discard the SNP where genotype for both TA and TUB are "0/0". What shall I do? thx

EDIT: THank you guys. The problem is actually quite silly. The vcf files I got contains multiple header. Seems my collaborator simply concatenate several vcf files using cat instead of merging them. Then I got errors in the processing using both vcftools-subset and python.

Both python and vcftools can easily solve these problem.

So take my lesson to double-check the original vcf file before any processing

vcf • 16k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 12.0 years ago by michealsmith ▴ 790

2

Entering edit mode

Probably just going to be fastest to script it up. You can also use awk '{print $your.column}' file.vcf

ADD REPLY • link 12.0 years ago by Zev.Kronenberg 12k

1

Entering edit mode

awk '{OFS="\t";print $your.column1_number, $your.column2_number}'  >
   your_2_columns_file

 awk '{OFS="\t"; print $your.column1_number,$your.column2_number}' |egrep -v '0/0\s+0/0'  >
   your_2_columns_file_without_0/0_genotypes

Since you working with large files, you should consider to take a look at unix tutorial.

Another approach is to use Galaxy - take a look at tutorials for text parsing

ADD REPLY • link 12.0 years ago by Zhidkov ▴ 600

0

Entering edit mode

thx...but seems egrep -v '0/0s+0/0' doesn't work.....actually I don't know how to grep such complex pattern

ADD REPLY • link 12.0 years ago by michealsmith ▴ 790

1

Entering edit mode

It would probably be worth figuring out where the error is coming from with vcf-subset as that suggests you have a poorly formatted vcf file

ADD REPLY • link 12.0 years ago by Laura ★ 1.8k

score 4 · Answer 1 · 2012-04-10

You can try using the 'cut' unix command:

cut yourFile.vcf -f 12,13

That would return column 12 and 13. 'f' parameter specifies columns you want to return, 'd' specifies delimiter to cut.

I recently wrote a brief entry about using unix commands to work with tab delimited files: http://blog.nextgenetics.net/?e=28

If you want to use a script solution, you can try this in python:

import sys
for line in open(sys.argv[1],'r'):
  columns = line.strip().split('\t')
  print columns[1], columns[2]

Save as yourName.py and use by: python yourName.py yourFile.vcf

score 3 · Answer 2 · 2012-04-10

PyVCF (code, documentation) has already done all the hard work of figuring out how to parse a *.vcf file, and presents it to you in a way similar to the csv module in the Python stdlib. The code below should allow you to get started:

import vcf

vcf_reader = vcf.Reader(open(r'path\to\zebrafish.vcf', 'rb'))
for record in vcf_reader:
    for call in record.samples:
        if call.sample=='TUB' or call.sample=='TA':
            if call.gt_nums != '0/0':
                print call.gt_nums

Ram · Answer 3 · 2014-11-17

2

Entering edit mode

9.4 years ago

arronslacey ▴ 320

I just get all lines that don't contain #

sed -n '/#/!p' g1.vcf

and then pipe into the cut command:

sed -n '/#/!p' g1.vcf | cut -f 1,8,11

hope that helps.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by arronslacey ▴ 320

score 1 · Answer 4 · 2012-04-10

A command-line one liner might be cooler, but this should work and it's not mysterious. Replace idx_first and idx_second with the index of your genotypes of interest, remembering that python arrays are 0-based.

python
f = open('foo.vcf')
fo = open('foo_stripped.vcf', 'w')
idx_first = 11
idx_first = 12
for line in f:
    if line[0]=='#':
        fo.write(line)
        continue
    a = line.rstrip('\r\n').split('\t')
    out = a[0:9]
    if not "0/0" in a[idx_first] and not "0/0" in a[idx_second]:
        out.append( a[idx_first] )  # first column with genotype
        out.append( a[idx_second] ) # second column with genotype
        fo.write( '\t'.join(out) + '\n')

fo.close()
f.close()