Question: Write header of contigs in a vcf file
0
gravatar for angelaparody
10 months ago by
angelaparody60
angelaparody60 wrote:

Hi,

I am having the following issue. I used STACKS to produce a .vcf file and it does not contain header for contigs, and I need this in order to edit the ID column which is all 0s at the moment (see below, well, it is a bit mixed up). This is how the vcf file looks like (header and a few more rows):

##fileformat=VCFv4.2                                                                                                        
##fileDate=20191104                                                                                                     
##source="Stacks v2.41"                                                                                                     
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total Depth for Each Allele">                                                                                                      
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">                                                                                                       
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">                                                                                                      
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">                                                                                                      
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allele Depth">                                                                                                       
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">                                                                                                     
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">                                                                                                      
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">                                                                                                      
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">                                                                                                       
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">                                                                                                        
##INFO=<ID=loc_strand,Number=1,Type=Character,Description="Genomic strand the corresponding Stacks locus aligns on">                                                                                                        
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Lib1_3A Lib1_3B Lib1_3C Lib1_3D Lib1_3E Lib1_3F Lib1_4A Lib1_4B Lib1_4C Lib1_4D Lib1_4E Lib1_4F Lib1_7A Lib1_7A2    Lib1_7D Lib1_7D2    Lib1_7E Lib1_7F
scaffold_45 10  0   A   C   0

And when I use this option to edit the #IDs (I want IDs to be like scaffold_45_10, which is scaffold _CHROM_POS):

bcftools annotate --set-id +'%CHROM\_%POS' myfilename.vcf > new.vcf.

I get this error:

[W::vcf_parse] Contig 'scaffold_45' is not defined in the header. (Quick workaround: index the file with tabix.)
Encountered error, cannot proceed. Please check the error output above.

Therefore I would need to define contigs in the header (that is what the error says). To do so, I tried what is explain in the last message in this post: bcftools_Issue#766 Following exactly what it is said there I got a vcf file that looks like this:

    ##contig=<ID=scaffold_45, length=120>                                                                                                       
    ##contig=<ID=scaffold_62, length=125>                                                                                                       
    ##contig=<ID=scaffold_69, length=132>                                                                                                       
    ##contig=<ID=scaffold_98, length=172>                                                                                                       
    ##contig=<ID=scaffold_154, length=149>                                                                                                      
    ##contig=<ID=scaffold_210, length=184>                                                                                                      
  (there are more rows)                                                                                                     
    **VIM - Vi IMproved 8.0 (2016 Sep 12, compiled Jun 21 2019 04:10:35)**                                                                                                      
    scaffold_45 10  0   A   C   0   PASS

Which is wrong. The header that was there has been replaced by the list of contigs IDs and lengths (which is what I want to include in the header but without losing the header that was there). There is also one extra row (VIM - Vi IMproved 8.0 (2016 Sep 12, compiled Jun 21 2019 04:10:35) that I don't know how it appeared and should not be there.

In short, I need to be able to edit the IDs of the vcf file, and for that I need to add the contig info in the header of the vcf file. The reason I want the IDs in the vcf file is to filter SNPs by IDs with vcftools. Does someone know how to solve this issue?

Thanks in advance,

'Angela

PS: I may tray to include the header of my initial vcf file into the header.txt and see what happens...

edit header vcf • 752 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by angelaparody60
1
gravatar for angelaparody
10 months ago by
angelaparody60
angelaparody60 wrote:

I found a solution! Maybe not the best, probably it could be done in a different -and faster- way, but I am not a bioinformatician and this worked for me :)

An output of STACKS is catalog.fa.gz and I obtained the .fai doing this:

gzip -cd ./catalog.fa.gz > catalog.fa
samtools faidx catalog.fa > catalog.fa.fai

Then I took this code from Biostars: Question: Add contig lenght to VCF header in a robust way

awk '{printf("##contig=<ID=%s,length=%d>\\n",$1,$2);}' ref.fai > ToEditHeader.txt

I have edited this txt file and used it to edit what is between ** (see below).

awk '/^#CHROM/ { printf("**##contig=<ID=1,length=195471971>\n##contig=<ID=2,length=182113224>\n**");} {print;}' inputfile.vcf > contigsInHeader.vcf

Then I used this (see below) from: Biostars: Question: How edit VCF file (specifically #CHROM and ID columns), which finally set IDs on my contigsInHeader.vcf file:

bcftools annotate --set-id +'%CHROM\_%POS' contigsInHeader.vcf > SnpsWithIDs.vcf

I hope this help to anyone having the same issue!

Cheers,

'Angela

ADD COMMENTlink written 10 months ago by angelaparody60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2153 users visited in the last hour