Question: Extracting Gene Names From Vcf File
3
gravatar for User 1933
6.2 years ago by
User 1933340
User 1933340 wrote:

I have an annotated VCF file, and I would like to extract gene names which are not necessarily appear in the same columns. The gene names are preceded by "txGN=" pattern.

I wonder if there is any flexible parser, preferably in R or using awk for such a purpose.

vcf parsing • 6.6k views
ADD COMMENTlink modified 6.2 years ago by Rm7.8k • written 6.2 years ago by User 1933340

Use vcftools --get-INFO option. So your script would be:

./vcftools --vcf your_vcf_file.vcf --get-INFO txGn --out vcf_file_gene_name_info

ADD REPLYlink written 3.0 years ago by das2000sidd30
3
gravatar for Jorge Amigo
6.2 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

a perl oneliner would do:

perl -lne 'print $1 while /txGN=([^;]+)/g' < input.vcf | uniq

(assuming that ";" is the delimiter character for the gene names, and that you want a list with unique gene names in it)

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Jorge Amigo11k
1

up vote: perl favoritism.

ADD REPLYlink written 6.2 years ago by Zev.Kronenberg11k

Double up vote: favoritism of perl favoritism

ADD REPLYlink written 6.2 years ago by Alex Paciorkowski3.3k
2
gravatar for brentp
6.2 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

it's not particularly flexible (or pretty), but you can do this in awk quite simply regardless of the column:

awk '{ split($0, a, "txGn="); split(a[2], b, /;|\t|\s/); print b[1] }'
ADD COMMENTlink written 6.2 years ago by brentp23k
2
gravatar for Rm
6.2 years ago by
Rm7.8k
Danville, PA
Rm7.8k wrote:

Here is the awk script i wrote to extract INFO TAGs for each line from VCF (like) files:

USAGE: awk -F"\t" -v InfoColumns="8,16,24" -v TAGS="txGN,DP,DP4,CLR" -f extract.vcf.info.Tag.awk union.7samples.tsv

#!/bin/awk -f
##extract.vcf.info.Tag.awk   
##INPUTS (pasted or) TSV or VCFs file with INFO field intact from VCF
## USAGE: awk -v InfoColumns="8,16,24" -v TAGS="txGN,DP,DP4,CLR" -f extract.vcf.info.Tag.awk union.7samples.tsv

#    BEGIN { FS = "\t" }  # if not using -F"\t" above

# excludes lines starting with # or ## 
(substr($1,1,1)!="#" && substr($1,2,1)!="#") {

printf $0 ; ##  Prints original Line

split(TAGS,key,",") ;
split(InfoColumns,col,",") ;

n = asorti(col,copy); # To preserve the original column order

for(i=1;i<=n;i++){
  split($col[copy[i]],info,";");

k = asorti(key,kapy) ; # To preserve the original key order
for(j=1;j<=k;j++){
           pat1=key[kapy[j]]"=";
     if ($col[copy[i]] ~ pat1){
        for (f in info){
                if (info[f] ~ pat1){
                sub(pat1,"",info[f]);
                sub(/"/,"",info[f]);
                printf "\t" info[f]; # Prints extracted info tag field
              }
         }
      }
     else
      printf "\t" "."; # Prints "dot" if not present 
  }
 }
  printf "\n";
}
ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Rm7.8k

+1 for a full-on multi-line awk script!

ADD REPLYlink written 6.2 years ago by brentp23k

Two minor suggestions on awk script.

  1. Suggest adding BEGIN { FS = "\t" } for VCF files, since the delimiter is tab not the default whitespace
  2. Instead of (substr($1,1,1)!="#" && substr($1,2,1)!="#") I would suggest /^[^#]/
ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by jjc80

thanks; one of many ways to code.....

ADD REPLYlink written 6.2 years ago by Rm7.8k
1
gravatar for jxchong
6.2 years ago by
jxchong160
Postdoc at the University of Washington
jxchong160 wrote:

vcftools vcf-query http://vcftools.sourceforge.net/perl_module.html#vcf-query

Have it output %txGN or %INFO/txGN depending on where txGN is stored

ADD COMMENTlink written 6.2 years ago by jxchong160
0
gravatar for Pierre Lindenbaum
6.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

I wrote a tool to extract a tag from a VCF: http://code.google.com/p/variationtoolkit/wiki/ExtractInfo

e.g:

$ gunzip -c data.vcf.gz |\
  extractinfo -t GN -i | \
  awk -F '       ' '($11 =="NOTCH2")' |\
  cut -d ' ' -f 3 | grep rs

rs6685892
rs2493392
rs2493420
rs7534585
rs7534586
rs2493409
rs2453040
rs2124109
ADD COMMENTlink written 6.2 years ago by Pierre Lindenbaum120k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1589 users visited in the last hour