Question: Addin gene information to VCF file
1
gravatar for SemiQuant
4.9 years ago by
SemiQuant70
South Africa
SemiQuant70 wrote:

HI,

 

I've been attempting to add gene information to my VCF file.  I have a vcf with my variants and a bed file with the gene names and their start and end position.  Ive tried to add this information to the vcf using GATK VariantAnnotator, vcftools annotate, bcftools annotate, bcftools insec but to no avail.  I'm working on M. tuberculosis.  could anyone help me out?

 

thanks

snp variant next-gen annotation • 8.5k views
ADD COMMENTlink modified 2.8 years ago by duke10 • written 4.9 years ago by SemiQuant70
9
gravatar for Sean
4.2 years ago by
Sean180
United States
Sean180 wrote:

I had the exact same scenario, and BCFtools worked great for me. The following will take your gene names from a BED file and add them as an INFO tag in your VCF:

bcftools annotate \
  -a genes.bed.gz \
  -c CHROM,FROM,TO,GENE \
  -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') \
  variants.vcf.gz

A few things to note:

  • genes.bed.gz must be a standard BED file that has been zipped using bgzip (bgzip genes.bed)
  • genes.bed.gz must indexed using tabix (tabix -p bed genes.bed.gz)
  • genes.bed.gz does not need a header (we define this with the -c option instead)
  • 'CHROM', 'FROM', and 'TO' are required to be passed to the -c option, but these columns do not actually get added to your VCF (only the 'GENE' will get added)
  • Because we're not annotating from a VCF/BCF file, the -h option is required (it will not work otherwise)
  • Whatever we pass to the -h option will simply get added to the VCF meta-information
ADD COMMENTlink written 4.2 years ago by Sean180
1
gravatar for duke
2.8 years ago by
duke10
San Diego
duke10 wrote:

ScienceBuff, you probably have figured this out by now. But I just wanted to post an answer because I've been struggling with adding gene annotations to my Mtb vcf files as well. Here is what finally got it to work:

1) Download the H37Rv reference genome from https://www.ncbi.nlm.nih.gov/ (use the genome database). I used the tabular format, which has the following columns:

CHROM FROM TO feature_ID locus_tag strand na_length gene product
NC_000962.3 1 1524 51951 Rv0001 + 1524 dnaA chromosomal replication initiation protein
NC_000962.3 2052 3260 52294 Rv0002 + 1209 dnaN DNA polymerase III subunit beta
NC_000962.3 3280 4437 52224 Rv0003 + 1158 recF recombination protein F

2) Compress the file (if not in .gz format):

bgzip H37Rv-ref.tab

3) Index the compressed reference file. Since the reference is in tab format, make sure your parameters match the correct columns (e.g., for the reference file that I used, -s corresponds to CHROM (column 1), -b corresponds to FROM (column 2), -e corresponds to TO (column 3)):

tabix -s 1 -b 2 -e 3 H37Rv-ref.tab.gz

4) Create a header file by creating a new document and naming it something like "header.hdr". Populate the header file with information for each column of the reference/annotation file used (the CHROM, TO, and FROM headers are standard can be left out of this file):

##INFO=<id=feature_id,number=1,type=integer,description="feature_id">
##INFO=<id=locus_tag,number=1,type=string,description="locus">
##INFO=<id=strand,number=1,type=string,description="strand">
##INFO=<id=na_length,number=1,type=integer,description="length">
##INFO=<id=gene,number=1,type=string,description="gene">
##INFO=<id=product,number=1,type=string,description="product">

5) You are now ready to annotate your vcf file. The -c parameters include the columns that you want to include in the annotated vcf file (CHROM, FROM, and TO are included by default). The dashes (-) indicate columns to leave out. Other than the default parameters (e.g., CHROM, FROM, and TO), the names of other columns to include are preceded by "INFO/".

bcftools annotate \
-a H37Rv-ref.tab.gz \
-h header.hdr \
-c CHROM,FROM,TO,-,INFO/locus_tag,-,-,INFO/gene,INFO/product \
variants.vcf > annotated.vcf

Note: make sure that the string listed under the CHROM column of the vcf file matches what is listed under the CHROM column of the reference file (e.g., NC_000962.3, in this case).


I'm not sure if these are the issues with your process, but consider:

1) Making the following changes to the column headers of your annotation file:

"#CHROM" to "CHROM" "chromStart" to "FROM" "chromEND" to "TO"

So your column headers should look like:

CHROM FROM TO name chr 1 1524 Rv0001 chr 2052 3260 Rv0002 chr 3280 4437 Rv0003

2) Your header file can then only consist of the following line:

##INFO=<id=name,number=1,type=string,description="name">

3) Make sure you tabix the gz version of your reference/annotation file using the following parameters: tabix -s 1 -b 2 -e 3 (not -p)

4) I did not have to index my variants (vcf) file using tabix.

5) Make sure the string under the CHROM column of your vcf file matches the name listed under your reference file (i.e., "chr").

6) Your annotation command will then be:

bcftools annotate \
-a annotation.bed.gz \
-h header.hdr \
-c CHROM,FROM,TO,INFO/name\
variants.vcf.gz> annotated.vcf.gz

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by duke10
0
gravatar for Pierre Lindenbaum
4.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

I wrote a tool named VCFBed: https://github.com/lindenb/jvarkit/wiki/VCFBed

 

curl "https://raw.github.com/arq5x/gemini/master/test/test1.snpeff.vcf" |\
sed 's/^chr//' |\
java -jar  dist/vcfbed.jar TABIXFILE=~/ncbibiosystem.bed.gz TAG=NCBIBIOSYS  FMT='($4|$5|$6|$7)' |\
grep -E '(^#CHR|NCBI)'

##INFO=<ID=NCBIBIOSYS,Number=.,Type=String,Description="metadata added from /home/lindenb/ncbibiosystem.bed.gz . Format was ($4|$5|$6|$7)">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1094PC0005  1094PC0009  1094PC0012  1094PC0013
1   69270   .   A   G   2694.18 .   AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=32.86    GT:AD:DP:GQ:PL  ./. ./. 1/1:0,3:3:9.03:106,9,0  1/1:0,6:6:18.05:203,18,0
1   69511   .   A   G   77777.27    .   AC=49;AF=0.875;AN=56;BaseQRankSum=0.150;DP=2816;DS;Dels=0.00;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=21.286;HRun=0;HaplotypeScore=3.8956;InbreedingCoeff=0.0604;MQ=32.32;MQ0=0;MQRankSum=1.653;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=27.68;ReadPosRankSum=2.261  GT:AD:DP:GQ:PL  ./. ./. 0/1:2,4:6:15.70:16,0,40 0/1:2,2:4:21.59:22,0,40
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by Pierre Lindenbaum123k
0
gravatar for Jorge Amigo
4.9 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

either vcftools annotate or bcftools annotate should very easily do. the only thing you have to bear in mind is that the annotation file (vcf or bed for instance) must be bgzip-compressed and tabix-indexed. a worked example would be the following:

bgzip genes.bed
tabix -p bed genes.bed.gz
bcftools annotate -a genes.bed.gz variants.vcf.gz
ADD COMMENTlink written 4.9 years ago by Jorge Amigo11k

Thanks Jorge,

Im still strugeling with this, I have crated a BED file with only the genes in my VCF to test and it does still not work.  I used the following code:

bgzip annotation.bed 

tabix -p bed annototation.bed.gz 

bgzip variants.vcf

tabix -s1 -b2 -e3 variants.vcf.gz


bcftools annotate -a annotation.bed.gz \

-h Header_file.hdr \

-c CHROM,FROM,TO,TAG,ID variants.vcf.gz

 

My VCF was generated by BWA alignment and GATK calling and the BED file is tab delimited looks like this:

#CHROM    chromStart    chromEnd    name
chr    1    1524    Rv0001
chr    2052    3260    Rv0002
chr    3280    4437    Rv0003
chr    4434    4997    Rv0004
chr    5123    7267    Rv0005
chr    7302    9818    Rv0006
chr    9914    10828    Rv0007

P.S my VCF CHROM is also called chr

Any help would be greatly appreciated.

ADD REPLYlink written 4.8 years ago by SemiQuant70
0
gravatar for SemiQuant
4.9 years ago by
SemiQuant70
South Africa
SemiQuant70 wrote:

Thanks for the responses.  I think the problem is that my BED file is incorrectly formatted.  I've tried bcftools annotate using a bgzipped, tabix index file like you mentioned and a tab delimited file indexed with tabix -s1 -b2 -e3 $annotations.   Then i tried using a header file in conjunction.  Could it be that when saving my tab limited file from excel it incorrectly saves it?

 

bgzip annotation.bed 

tabix -p bed annototation.bed.gz 

bgzip variants.vcf

tabix -s1 -b2 -e3 variants.vcf.gz

 

bcftools annotate -a annotation.bed.gz \

-h Header_file.hdr \

-c CHROM,FROM,TO,TAG,ID variants.vcf.gz

ADD COMMENTlink written 4.9 years ago by SemiQuant70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1781 users visited in the last hour