Question

Vcf Sort According To Order Of The Reference File

1

Entering edit mode

10.5 years ago

Tonyzeng ▴ 310

Hi, When I run GATK with Error and showed that the variant file with VCF fomat file has not the same order with reference file or not compatible. I used VCFsorter.pl to sort my vcf file according to the genome file using

$ perl vcfsorter.pl genome.dict New.vcf > New1.vcf 2>STDERR

It generated a new file called New1.vcf. However, New1.vcf produced a file with only header in the file but removing all the data line information ( original VCF or New.vcf file include both header and data line information), besides, ##contig+<id=number is="" totally="" the="" same="" as="" the="" new.vcf="" with="" no="" any="" changes.<="" p="">

The chromosome order in my reference file (genome.dict or genome.fa) is as 10,11,12,13,14,15,16,17,18,19,1,2,3,4,5,6,7,8,9,X,Y from header of New.vcf file, it has chromosome order as 1,10,11,12,13,14,15,16,17,18,19,2,3,4,5,6,7,8,9,X,Y New1.vcf file (sorted vcf file using vcfsorter.pl program above) has the same chromosome order as New.vcf or 1,10,11,12,13,14,15,16,17,18,19,2,3,4,5,6,7,8,9,X,Y

I am pretty confused about follows:

1) in-compatiblity between VCF file and Reference file means JUST chromosome order like above? since my reference is .fa format, VCF file header requires the same ##contig=number order with the Reference files?

2) why VCFsorter.pl generate the VCF file with just leaving header but ignore the data line information?

3) when i compare the reference .dict file (genome.dict) with sorted VCF file (New1.vcf), they seems quite different in the header content, is this normal?

[rzeng@prism reference]$ more genome.dict

@HD VN:1.4 SO:unsorted @SQ SN:chr10 LN:130694993 @SQ SN:chr11 LN:122082543 @SQ SN:chr12 LN:120129022 @SQ SN:chr13 LN:120421639 @SQ SN:chr14 LN:124902244 @SQ SN:chr15 LN:104043685 @SQ SN:chr16 LN:98207768 @SQ SN:chr17 LN:94987271 @SQ SN:chr18 LN:90702639 @SQ SN:chr19 LN:61431566 @SQ SN:chr1 LN:195471971 @SQ SN:chr2 LN:182113224 @SQ SN:chr3 LN:160039680 @SQ SN:chr4 LN:156508116 @SQ SN:chr5 LN:151834684 @SQ SN:chr6 LN:149736546 @SQ SN:chr7 LN:145441459 @SQ SN:chr8 LN:129401213 @SQ SN:chr9 LN:124595110 @SQ SN:chrM LN:16299 @SQ SN:chrX LN:171031299 @SQ SN:chrY LN:91744698 UR:file:/raid1/rzeng/reference/genome.fa M5:7831ecda5dd6bcf838e2452ea0139eac UR:file:/raid1/rzeng/reference/genome.fa M5:e168c7a3194813f597181f26bb1bd13f UR:file:/raid1/rzeng/reference/genome.fa M5:671f85bb54a6e097d631e2e2dd813ad4 UR:file:/raid1/rzeng/reference/genome.fa M5:7f9b9fa3fbd9a38634107dfdc7fd8dc8 UR:file:/raid1/rzeng/reference/genome.fa M5:bf4e1efa25a8fd23b41c91f9bcb86388 UR:file:/raid1/rzeng/reference/genome.fa M5:106358dace00e5825ae337c1f1821b5e UR:file:/raid1/rzeng/reference/genome.fa M5:5482110a6cedd3558272325eee4c5a17 UR:file:/raid1/rzeng/reference/genome.fa M5:0d21e8edbfcd8410523b2b94e6dae892 UR:file:/raid1/rzeng/reference/genome.fa M5:46fda2f7e6dbf91bff91d6703e004afb UR:file:/raid1/rzeng/reference/genome.fa M5:7d363594531514ce41dfacfd97bc995d UR:file:/raid1/rzeng/reference/genome.fa M5:c4ec915e7348d42648eefc1534b71c99 UR:file:/raid1/rzeng/reference/genome.fa M5:fe020a692e23f8468b376e14e304a10f UR:file:/raid1/rzeng/reference/genome.fa M5:50f9385167e70825931231ddf1181b80 UR:file:/raid1/rzeng/reference/genome.fa M5:e7bdfb3ce7f54d2720b0718ed2ea063c UR:file:/raid1/rzeng/reference/genome.fa M5:095f3d4ebe1f0bafff057cc9b130186d UR:file:/raid1/rzeng/reference/genome.fa M5:62628d042ea5e01adff5b481d23def67 UR:file:/raid1/rzeng/reference/genome.fa M5:65da9ab01a76dcbcaef6f32a753585c1 UR:file:/raid1/rzeng/reference/genome.fa M5:dd2d079a37c02e8a3f95abff9e37ac69 UR:file:/raid1/rzeng/reference/genome.fa M5:ef8a85e56b750c10568656361fac7990 UR:file:/raid1/rzeng/reference/genome.fa M5:11c8af2a2528b25f2c080ab7da42edda UR:file:/raid1/rzeng/reference/genome.fa M5:b3db6d6da78d5268688ee395c2c8cb4a UR:file:/raid1/rzeng/reference/genome.fa M5:837a35bcca18643d030d4eec5e5b9c64

[rzeng@prism reference]$ more New1.vcf

##fileformat=VCFv4.1 ##samtoolsVersion=0.1.18-r572 ##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa ##source_20130026.2=vcf-annotate(r813) -f +/D=200/d=5/q=20/w=2/a=5 (AJ,AKR,CASTEiJ,CBAJ,DBA2J,FVBNJ,LPJ,PWKPhJ,WSBEiJ) ##source_20130026.2=vcf-annotate(r813) -f +/D=250/d=5/q=20/w=2/a=5 (129S1,BALBcJ,C3HHeJ,C57BL6NJ,NODShiLtJ,NZO,Spretus) ##source_20130305.2=vcf-annotate(r818) -f +/D=155/d=5/q=20/w=2/a=5 (129P2) ##source_20130304.2=vcf-annotate(r818) -f +/D=100/d=5/q=20/w=2/a=5 (129S5) ##contig=<id=1,length=195471971> ##contig=<id=10,length=130694993> ##contig=<id=11,length=122082543> ##contig=<id=12,length=120129022> ##contig=<id=13,length=120421639> ##contig=<id=14,length=124902244> ##contig=<id=15,length=104043685> ##contig=<id=16,length=98207768> ##contig=<id=17,length=94987271> ##contig=<id=18,length=90702639> ##contig=<id=19,length=61431566> ##contig=<id=2,length=182113224> ##contig=<id=3,length=160039680> ##contig=<id=4,length=156508116> ##contig=<id=5,length=151834684> ##contig=<id=6,length=149736546> ##contig=<id=7,length=145441459> ##contig=<id=8,length=129401213> ##contig=<id=9,length=124595110> ##contig=<id=x,length=171031299> ##FILTER=<id=basequalbias,description="min p-value="" for="" baseq="" bias="" (info="" pv4)="" [0]"=""> ##FILTER=<id=enddistbias,description="min p-value="" for="" end="" distance="" bias="" (info="" pv4)="" [0.0001]"=""> ##FILTER=<id=gapwin,description="window size="" for="" filtering="" adjacent="" gaps="" [3]"=""> ##FILTER=<id=het,description="genotype call="" is="" heterozygous="" (low="" quality)="" []"=""> ##FILTER=<id=mapqualbias,description="min p-value="" for="" mapq="" bias="" (info="" pv4)="" [0]"=""> ##FILTER=<id=maxdp,description="maximum read="" depth="" (info="" dp="" or="" info="" dp4)="" [200]"=""> ##FILTER=<id=minab,description="minimum number="" of="" alternate="" bases="" (info="" dp4)="" [5]"=""> ##FILTER=<id=mindp,description="minimum read="" depth="" (info="" dp="" or="" info="" dp4)="" [5]"=""> ##FILTER=<id=minmq,description="minimum rms="" mapping="" quality="" for="" snps="" (info="" mq)="" [20]"=""> ##FILTER=<id=qual,description="minimum value="" of="" the="" qual="" field="" [10]"=""> ##FILTER=<id=refn,description="reference base="" is="" n="" []"=""> ##FILTER=<id=snpgap,description="snp within="" int="" bp="" around="" a="" gap="" to="" be="" filtered="" [2]"=""> ##FILTER=<id=strandbias,description="min p-value="" for="" strand="" bias="" (info="" pv4)="" [0.0001]"=""> ##FILTER=<id=vdb,description="minimum variant="" distance="" bias="" (info="" vdb)="" [0]"=""> ##FORMAT=<id=dp,number=1,type=integer,description="# high-quality="" bases"=""> ##FORMAT=<id=gl,number=3,type=float,description="likelihoods for="" rr,ra,aa="" genotypes="" (r="ref,A=alt)""> ##FORMAT=<id=gq,number=1,type=integer,description="genotype quality"=""> ##FORMAT=<id=gt,number=1,type=string,description="genotype"> ##FORMAT=<id=pl,number=g,type=integer,description="list of="" phred-scaled="" genotype="" likelihoods"=""> ##FORMAT=<id=sp,number=1,type=integer,description="phred-scaled strand="" bias="" p-value"=""> ##FORMAT=<id=fi,number=1,type=integer,description="pass(1) or="" fail="" (0)="" filter"=""> ##INFO=<id=ac1,number=1,type=float,description="max-likelihood estimate="" of="" the="" first="" alt="" allele="" count="" (no="" hwe="" assumption)"=""> ##INFO=<id=af1,number=1,type=float,description="max-likelihood estimate="" of="" the="" first="" alt="" allele="" frequency="" (assuming="" hwe)"=""> ##INFO=<id=dp,number=1,type=integer,description="raw read="" depth"=""> ##INFO=<id=dp4,number=4,type=integer,description="# high-quality="" ref-forward="" bases,="" ref-reverse,="" alt-forward="" and="" alt-reverse="" bases"=""> ##INFO=<id=indel,number=0,type=flag,description="indicates that="" the="" variant="" is="" an="" indel."=""> ##INFO=<id=mdv,number=1,type=integer,description="maximum number="" of="" high-quality="" non-reference="" bases"=""> ##INFO=<id=mq,number=1,type=float,description="rms mapping="" quality"=""> ##INFO=<id=msd,number=1,type=float,description="maximum depth="" across="" non-ref="" genotypes"=""> ##INFO=<id=pv0,number=1,type=float,description="p-value for="" strand="" bias"=""> ##INFO=<id=pv1,number=1,type=float,description="p-value for="" baseq="" bias"=""> ##INFO=<id=pv2,number=1,type=float,description="p-value for="" mapq="" bias"=""> ##INFO=<id=pv3,number=1,type=float,description="p-value for="" tail="" distance="" bias"=""> ##INFO=<id=pv4,number=4,type=float,description="p-values for="" strand="" bias,="" baseq="" bias,="" mapq="" bias="" and="" tail="" distance="" bias"=""> ##INFO=<id=qd,number=1,type=float,description="quality by="" depth"=""> ##INFO=<id=sb,number=1,type=float,description="strand bias"=""> ##INFO=<id=vdb,number=1,type=float,description="variant distance="" bias"=""> ##INFO=<id=ac,number=.,type=integer,description="allele count="" in="" genotypes"=""> ##INFO=<id=an,number=1,type=integer,description="total number="" of="" alleles="" in="" called="" genotypes"=""> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 129P2

gatk • 21k views

ADD COMMENT • link updated 8.9 years ago by Biostar 20 • written 10.5 years ago by Tonyzeng ▴ 310

1

Entering edit mode

please, edit and format your question.

ADD REPLY • link 10.5 years ago by Pierre Lindenbaum 161k

Ram · Answer 1 · 2013-10-27

5

Entering edit mode

10.5 years ago

Pierre Lindenbaum 161k

(update: ) Use picard sortvcf

~~I wrote a tool named SortVcfOnRef to sort a VCF using an indexed reference.~~

~~Example:~~

~~cat input.vcf |\ java -jar dist/sortvcfonref.jar REF=ref.fa |\ bgzip -c > result.vcf.gz && \ tabix -p vcf -f result.vcf.gz~~

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Pierre, thank you for the information!

I opened the linker of SortVcfOnRef but could not find where to download sortvcfonref.jar script, did I miss something here?

Thanks again

ADD REPLY • link 10.5 years ago by Tonyzeng ▴ 310

2

Entering edit mode

https://github.com/lindenb/jvarkit

ADD REPLY • link 10.5 years ago by Pierre Lindenbaum 161k

Ram · Answer 2 · 2013-11-13

3

Entering edit mode

10.4 years ago

lh3 33k

bgzip old.vcf; tabix -pvcf old.vcf.gz
cat chr_list.txt | xargs tabix -h old.vcf.gz > new.vcf

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.4 years ago by lh3 33k

0

Entering edit mode

Really, helpful! Kewl trick!

ADD REPLY • link 5.1 years ago by kanika.151 ▴ 130

Ram · Answer 3 · 2013-10-28

0

Entering edit mode

10.5 years ago

Tonyzeng ▴ 310

Thank you Pirre, when I run the SortVcfOnRef, it produced an empty file named result.vcf.gz. I think I missed something there. Now I copy most of the commands and process details, thank you!

Here is build.properties file I have edited as

bigwig.dir=/raid1/rzeng/softwares/bigwig
picard.version=1.101
picard.dir=/raid1/rzeng/softwares/picard-tools-${picard.version}
picard.jar=${picard.dir}/picard-${picard.version}.jar
sam.jar=${picard.dir}/sam-${picard.version}.jar`
variant.jar=${picard.dir}/variant-${picard.version}.jar
tribble.jar=${picard.dir}/tribble-${picard.version}.jar
berkeleydb.jar=/raid1/rzeng/softwares/je-5.0.34/lib/je-5.0.34.jar

Then I did

$ cat build.properties

then,

$ ant sortvcfonref

after it showed "successful", I run

$ cat New.vcf | java -jar /raid1/rzeng/jvarkit/dist/sortvcfonref.jar REF=genome.fa | bgzip -c > result.vcf.gz && tabix -p vcf -f result.vcf.gz &

It showed like this

-bash: line 37: bgzip: command not found

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.5 years ago by Tonyzeng ▴ 310

1

Entering edit mode

bgzip is not in your PATH , install tabix or you can simply run

java -jar dist/sortvcfonref.jar  REF=ref.fa < input.vcf > output.vcf

ADD REPLY • link 10.5 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

This should be a comment, not an answer to your question

ADD REPLY • link 10.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

thank you, Pierre, I run java -jar dist/sortvcfonref.jar REF=ref.fa < input.vcf > output.vcf and generate a 8.0k file output.vcf (removing data line information). My original input.vcf is 1.6G with data line information. also the order of ##contig= still not change according to reference seq.

Did i miss something? thank you very much, Pierre

ADD REPLY • link 10.5 years ago by Tonyzeng ▴ 310

1

Entering edit mode

Run

 java -jar dist/sortvcfonref.jar REF=ref.fa < input.vcf > output.vcf  2> err.txt

what 's in err.txt ?

ADD REPLY • link 10.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I run it as $ java -jar /raid1/rzeng/jvarkit/dist/sortvcfonref.jar REF=genome.fa < New.vcf > output.vcf 2> err.txt

then err.txt say

Mon Oct 28 15:35:29 CDT 2013] com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef REF=genome.fa    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVE L=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false



[Mon Oct 28 15:35:29 CDT 2013] Executing as rzeng@prism.cluster on Linux 2.6.32-358.18.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_07_01_09_31-b00; Picard version: null

[Mon Oct 28 15:35:29 CDT 2013] Executing as rzeng@prism.cluster on Linux 2.6.32-358.18.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_07_01_09_31-b00; Picard version: null

java.lang.RuntimeException: unknown chromosome 1 in 1    3000185    .    G    T    234.33    PASS    AC1=1;AC=22;AF1=1;AN=36;DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV

1=0.49;PV2=1;PV3=1;PV4=1,0.49,1,1;SB=0.4821;VDB=0.0244 GT:GQ:DP:SP:PL:FI 1/1:99:17:0:216,51,0:1

at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef$VariantComparator.ref(SortVcfOnRef.java:83)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef$VariantComparator.compare(SortVcfOnRef.java:95)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef$VariantComparator.compare(SortVcfOnRef.java:78)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324)
at java.util.TimSort.sort(TimSort.java:203)
at java.util.Arrays.sort(Arrays.java:727)
at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:203)
at net.sf.samtools.util.SortingCollection.add(SortingCollection.java:150)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef.doWork(SortVcfOnRef.java:203)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef.main(SortVcfOnRef.java:231)

[Mon Oct 28 15:35:30 CDT 2013] com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=2025979904

ADD REPLY • link 10.5 years ago by Tonyzeng ▴ 310

1

Entering edit mode

and there is a chromosome "1" in "genome.fa " ???

ADD REPLY • link 10.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I grep all the words with chr* in genome.fa, it showed as following

chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY

in my variants VCF, it shows #chrom 1, 10 , ....

Maybe, I should change #chrom to #chr as the same as genome.fa?

ADD REPLY • link 10.5 years ago by Tonyzeng ▴ 310

0

Entering edit mode

yes, you should

ADD REPLY • link 10.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

thank you Pirrer

ADD REPLY • link 10.5 years ago by Tonyzeng ▴ 310

0

Entering edit mode

I have changed "chrom" of input.vcf to "chr" as follows

#CHR    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    129P2
1    3000126    .    G    T    51.33    Qual;MinAB;EndDistBias;MinDP    
AC1=1;AC=34;AF1=1;AN=36;DP4=4,0,77,11;DP=237;MDV=99;MQ=42;MSD=6;PV0=1;PV1=1;PV2=
1;PV3=0.094;PV4=1,1,1,0.094;QD=0.0061;SB=0.3000;VDB=0.0038    GT:GQ:DP:SP:PL:F
I    1/1:99:6:0:82,3,0:1
1    3000185    .    G    T    234.33    PASS    AC1=1;AC=22;AF1=1;AN=36;
DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV1=0.49;PV2=1;PV3=1;PV4=1,0.49

Then, I rerun sortvcfonref.jar using input.vcf that has been changed

java -jar dist/sortvcfonref.jar  REF=ref.fa < input.vcf > output.vcf

It still showed the same information as,

java.lang.RuntimeException: unknown chromosome 1 in 1    3000185    .    G    T    234.33    PASS    AC1=1;AC=22;AF1=1;AN=36;DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV1=0.49;PV2=1;PV3=1;PV4=1,0.49,1,1;SB=0.4821;VDB=0.0244    GT:GQ:DP:SP:PL:FI    1/1:99:17:0:216,51,0:1

Then, I go back to grep the line which has unknown chromosome 1 and it showed as follows

1       3000185 .       G       T       234.33  PASS    AC1=1;AC=22;AF1=1;AN=36;DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV1=0.49;PV2=1;PV3=1;PV4=1,0.49,1,1;SB=0.4821;VDB=0.0244       GT:GQ:DP:SP:PL:FI       1/1:99:17:0:216,51,0:1

Do not find anything special here in the line of input.vcf and feel confused now.

ADD REPLY • link 10.5 years ago by Tonyzeng ▴ 310

score 0 · Answer 4 · 2013-11-12

0

Entering edit mode

10.4 years ago

kreitzman.maayan • 0

I found this tool, for resorting a vcf according to a reference .dict file, simple to use: http://code.google.com/p/vcfsorter/

ADD COMMENT • link 10.4 years ago by kreitzman.maayan • 0