Question: modifying the header of VCF file
0
gravatar for zengtony743
2.6 years ago by
zengtony74370
Canada
zengtony74370 wrote:

Thank you for whoever can help me!!!

1) When I ran

$ `java -jar /hpf/tools/centos6/gatk/3.6.0/GenomeAnalysisTK.jar -R genome.fa -T SelectVariants --variant Newsnp.vcf -o testSNP.vcf -sn
AKRJ -sn AJ &

`

2) it shows

ERROR variant contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9, X]
ERROR sequence contigs = [chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
ERROR ------------------------------------------------------------------------------------------

3) the VCF header of Newsnp.vcf is

contig=<ID=1,length=195471971> 
contig=<ID=10,length=130694993> 
contig=<ID=11,length=122082543> 
contig=<ID=12,length=120129022> 
contig=<ID=13,length=120421639> 
contig=<ID=14,length=124902244> 
contig=<ID=15,length=104043685> 
contig=<ID=16,length=98207768> 
contig=<ID=17,length=94987271> 
contig=<ID=18,length=90702639> 
contig=<ID=19,length=61431566> 
contig=<ID=2,length=182113224> 
contig=<ID=3,length=160039680> 
contig=<ID=4,length=156508116> 
contig=<ID=5,length=151834684> 
contig=<ID=6,length=149736546> 
contig=<ID=7,length=145441459> 
contig=<ID=8,length=129401213> 
contig=<ID=9,length=124595110> 
contig=<ID=X,length=171031299>

However, the sequence contigs are

chr10,
chr11, 
chr12, 
chr13, 
chr14, 
chr15, 
chr16, 
chr17, 
chr18, 
chr19, 
chr1, 
chr2, 
chr3, 
chr4, 
chr5, 
chr6, 
chr7, 
chr8, 
chr9, 
chrM, 
chrX, 
chrY

How can I change the VCF header from above to the followings to make it compatible ?

This is the header of VCF should be like after modifying

contig=<ID=chr10,length=130694993>
contig=<ID=chr11,length=122082543>
contig=<ID=chr12,length=120129022>
contig=<ID=chr13,length=120421639>
contig=<ID=chr14,length=124902244>
contig=<ID=chr15,length=104043685>
contig=<ID=chr16,length=98207768>
contig=<ID=chr17,length=94987271>
contig=<ID=chr18,length=90702639>
contig=<ID=chr19,length=61431566>
contig=<ID=chr1,length=195471971>
contig=<ID=chr2,length=182113224>
contig=<ID=chr3,length=160039680>
contig=<ID=chr4,length=156508116>
contig=<ID=chr5,length=151834684>
contig=<ID=chr6,length=149736546>
contig=<ID=chr7,length=145441459>
contig=<ID=chr8,length=129401213>
contig=<ID=chr9,length=124595110>
contig=<ID=chrX,length=171031299>
header order vcf • 1.7k views
ADD COMMENTlink modified 3 months ago by zx87546.8k • written 2.6 years ago by zengtony74370
1

It would be safer to just use the reference genome which was used for the mapping of the reads.

ADD REPLYlink written 2.6 years ago by WouterDeCoster37k
1

If the vcf file is not so big I would do it manually with a text editor. If a text editor doesn't work I would try with "sed"

    sed 's/contig=<ID=1/contig=<ID=chr1/g' old.vcf > new1.vcf
    sed 's/contig=<ID=2/contig=<ID=chr2/g' new1.vcf > new2.vcf
    sed 's/contig=<ID=3/contig=<ID=chr3/g' new3.vcf > new3.vcf

But you have to do it 23 times and maybe there are some problems with characters like "=" or "<"

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Pol60
1

Not only the header needs modification. Every variant would need modification.

ADD REPLYlink written 2.6 years ago by WouterDeCoster37k
2
gravatar for geek_y
2.6 years ago by
geek_y9.3k
Barcelona/CRG/London/Imperial
geek_y9.3k wrote:

the complaint is more about the order and representation of chr in your VCF file. Just do something like:

awk '{ if ($0 ~ /^#/) { gsub("ID=","ID=chr",$0); print } else { print "chr"$0} }'  tmp.vcf > new.vcf

and then

  1. use SortVcf to sort your vcf providing the sequence dictionary.

  2. delete the vcf index file. The picard indexes are sometimes not compatible with GATK. GATK creates its own vcf index.

As mentioned by WouterDeCoster, try to have only the chromosomes that are present in genome fasta.

Check here for some info. http://gatkforums.broadinstitute.org/gatk/discussion/1328/errors-about-contigs-in-bam-or-vcf-files-not-being-properly-ordered-or-sorted

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by geek_y9.3k

Thanks very much, Goutham Atla! I have changed the "ID=" to "ID=chr" perfectly using awk that you posted.

However, SortVcf does not work in this case because the order of "chr" in header lines and data lines are BOTH inconsistent with the order of reference.dict. According to the dictionary it says that SortVcf sorts the records in VCF files according to the order of the contigs in the header/sequence dictionary and then by coordinate. in my case, the "chr" order in the header of VCF is not the same as reference. unless the "chr" order of header is the same as reference, it will sort the data lines in the way i need.

in the example of "SortVcf"

java -jar picard.jar SortVcf \
      I=vcf_1.vcf \
      I=vcf_2.vcf \
      O=sorted.vcf

there is no reference file called "reference.dict". it does not work in my case.

I tried another script called vcfsorter.pl.

perl vcfsorter.pl genome.dict new.vcf > Newdbsnp.vcf 2> STDERR &

you can see "genome.dict" is used as a reference to sort the chr order of both header lines and data lines.

Unfortunately, my supercomputer shows

+  Killed                  perl vcfsorter.pl genome.dict new.vcf > Newdbsnp.vcf 2> STDERR

not sure why, but seems like this script has used a lot of computer resources even my account has 70GB of vmem quota that i can use to run program

any idead?

ADD REPLYlink written 2.6 years ago by zengtony74370
0
gravatar for zengtony743
2.6 years ago by
zengtony74370
Canada
zengtony74370 wrote:

Here is the problem when I ran SorterVcf

java -jar /hpf/tools/centos6/picard-tools/2.5.0/picard.jar SortVcf SD=genome.dict I=new.vcf o=Newdbsnp.vcf

it shows:

[rzeng@qlogin4 reference]$ [Sat Aug 06 18:02:21 EDT 2016] picard.vcf.SortVcf INPUT=[new.vcf] OUTPUT=Newdbsnp.vcf SEQUENCE_DICTIONARY=genome.dict VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Sat Aug 06 18:02:21 EDT 2016] Executing as rzeng@qlogin4 on Linux 2.6.32-642.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14; Picard version: 2.5.0(2c370988aefe41f579920c8a6a678a201c5261c1_1466708365)
[Sat Aug 06 18:02:21 EDT 2016] picard.vcf.SortVcf done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2027945984
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=195471971,dict_index=0,assembly=null) was found when SAMSequenceRecord(name=chr10,length=130694993,dict_index=0,assembly=null) was expected.
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:126)
at picard.vcf.SortVcf.doWork(SortVcf.java:95)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chr1,length=195471971,dict_index=0,assembly=null) was found when SAMSequenceRecord(name=chr10,length=130694993,dict_index=0,assembly=null) was expected.
at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:166)
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:124)
... 4 more

[1]+ Exit 1 java -jar /hpf/tools/centos6/picard-tools/2.5.0/picard.jar SortVcf SD=genome.dict I=new.vcf o=Newdbsnp.vcf

My computer has 70GB vmem capacity. It should not be killed because of computer. Did i miss something?

ADD COMMENTlink written 2.6 years ago by zengtony74370
1

the error message is really clear:

SAM dictionaries are not the same: 

SAMSequenceRecord(name=chr1,length=195471971,dict_index=0,assembly=null)

was found when 

SAMSequenceRecord(name=chr10,length=130694993,dict_index=0,assembly=null) was expected.
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Pierre Lindenbaum118k

Thanks, Pierre. Do you have idea why

perl vcfsorter.pl genome.dict new.vcf > Newdbsnp.vcf 2> STDERR

Was killed? Thanks

ADD REPLYlink written 2.6 years ago by zengtony74370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1129 users visited in the last hour