Question: Vcf Sort According To Order Of The Reference File
1
gravatar for Tonyzeng
5.5 years ago by
Tonyzeng300
Tonyzeng300 wrote:

Hi, When I run GATK with Error and showed that the variant file with VCF fomat file has not the same order with reference file or not compatible. I used VCFsorter.pl to sort my vcf file according to the genome file using

$ perl vcfsorter.pl genome.dict New.vcf > New1.vcf 2>STDERR

It generated a new file called New1.vcf. However, New1.vcf produced a file with only header in the file but removing all the data line information ( original VCF or New.vcf file include both header and data line information), besides, ##contig+<id=number is="" totally="" the="" same="" as="" the="" new.vcf="" with="" no="" any="" changes.<="" p="">

The chromosome order in my reference file (genome.dict or genome.fa) is as 10,11,12,13,14,15,16,17,18,19,1,2,3,4,5,6,7,8,9,X,Y from header of New.vcf file, it has chromosome order as 1,10,11,12,13,14,15,16,17,18,19,2,3,4,5,6,7,8,9,X,Y New1.vcf file (sorted vcf file using vcfsorter.pl program above) has the same chromosome order as New.vcf or 1,10,11,12,13,14,15,16,17,18,19,2,3,4,5,6,7,8,9,X,Y

I am pretty confused about follows:

1) in-compatiblity between VCF file and Reference file means JUST chromosome order like above? since my reference is .fa format, VCF file header requires the same ##contig=number order with the Reference files?

2) why VCFsorter.pl generate the VCF file with just leaving header but ignore the data line information?

3) when i compare the reference .dict file (genome.dict) with sorted VCF file (New1.vcf), they seems quite different in the header content, is this normal?

[rzeng@prism reference]$ more genome.dict

@HD VN:1.4 SO:unsorted @SQ SN:chr10 LN:130694993 UR:file:/raid1/rzeng/reference/genome.fa M5:7831ecda5dd6bcf838e2452ea0139eac @SQ SN:chr11 LN:122082543 UR:file:/raid1/rzeng/reference/genome.fa M5:e168c7a3194813f597181f26bb1bd13f @SQ SN:chr12 LN:120129022 UR:file:/raid1/rzeng/reference/genome.fa M5:671f85bb54a6e097d631e2e2dd813ad4 @SQ SN:chr13 LN:120421639 UR:file:/raid1/rzeng/reference/genome.fa M5:7f9b9fa3fbd9a38634107dfdc7fd8dc8 @SQ SN:chr14 LN:124902244 UR:file:/raid1/rzeng/reference/genome.fa M5:bf4e1efa25a8fd23b41c91f9bcb86388 @SQ SN:chr15 LN:104043685 UR:file:/raid1/rzeng/reference/genome.fa M5:106358dace00e5825ae337c1f1821b5e @SQ SN:chr16 LN:98207768 UR:file:/raid1/rzeng/reference/genome.fa M5:5482110a6cedd3558272325eee4c5a17 @SQ SN:chr17 LN:94987271 UR:file:/raid1/rzeng/reference/genome.fa M5:0d21e8edbfcd8410523b2b94e6dae892 @SQ SN:chr18 LN:90702639 UR:file:/raid1/rzeng/reference/genome.fa M5:46fda2f7e6dbf91bff91d6703e004afb @SQ SN:chr19 LN:61431566 UR:file:/raid1/rzeng/reference/genome.fa M5:7d363594531514ce41dfacfd97bc995d @SQ SN:chr1 LN:195471971 UR:file:/raid1/rzeng/reference/genome.fa M5:c4ec915e7348d42648eefc1534b71c99 @SQ SN:chr2 LN:182113224 UR:file:/raid1/rzeng/reference/genome.fa M5:fe020a692e23f8468b376e14e304a10f @SQ SN:chr3 LN:160039680 UR:file:/raid1/rzeng/reference/genome.fa M5:50f9385167e70825931231ddf1181b80 @SQ SN:chr4 LN:156508116 UR:file:/raid1/rzeng/reference/genome.fa M5:e7bdfb3ce7f54d2720b0718ed2ea063c @SQ SN:chr5 LN:151834684 UR:file:/raid1/rzeng/reference/genome.fa M5:095f3d4ebe1f0bafff057cc9b130186d @SQ SN:chr6 LN:149736546 UR:file:/raid1/rzeng/reference/genome.fa M5:62628d042ea5e01adff5b481d23def67 @SQ SN:chr7 LN:145441459 UR:file:/raid1/rzeng/reference/genome.fa M5:65da9ab01a76dcbcaef6f32a753585c1 @SQ SN:chr8 LN:129401213 UR:file:/raid1/rzeng/reference/genome.fa M5:dd2d079a37c02e8a3f95abff9e37ac69 @SQ SN:chr9 LN:124595110 UR:file:/raid1/rzeng/reference/genome.fa M5:ef8a85e56b750c10568656361fac7990 @SQ SN:chrM LN:16299 UR:file:/raid1/rzeng/reference/genome.fa M5:11c8af2a2528b25f2c080ab7da42edda @SQ SN:chrX LN:171031299 UR:file:/raid1/rzeng/reference/genome.fa M5:b3db6d6da78d5268688ee395c2c8cb4a @SQ SN:chrY LN:91744698 UR:file:/raid1/rzeng/reference/genome.fa M5:837a35bcca18643d030d4eec5e5b9c64

[rzeng@prism reference]$ more New1.vcf

##fileformat=VCFv4.1 ##samtoolsVersion=0.1.18-r572 ##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa ##source_20130026.2=vcf-annotate(r813) -f +/D=200/d=5/q=20/w=2/a=5 (AJ,AKR,CASTEiJ,CBAJ,DBA2J,FVBNJ,LPJ,PWKPhJ,WSBEiJ) ##source_20130026.2=vcf-annotate(r813) -f +/D=250/d=5/q=20/w=2/a=5 (129S1,BALBcJ,C3HHeJ,C57BL6NJ,NODShiLtJ,NZO,Spretus) ##source_20130305.2=vcf-annotate(r818) -f +/D=155/d=5/q=20/w=2/a=5 (129P2) ##source_20130304.2=vcf-annotate(r818) -f +/D=100/d=5/q=20/w=2/a=5 (129S5) ##contig=<id=1,length=195471971> ##contig=<id=10,length=130694993> ##contig=<id=11,length=122082543> ##contig=<id=12,length=120129022> ##contig=<id=13,length=120421639> ##contig=<id=14,length=124902244> ##contig=<id=15,length=104043685> ##contig=<id=16,length=98207768> ##contig=<id=17,length=94987271> ##contig=<id=18,length=90702639> ##contig=<id=19,length=61431566> ##contig=<id=2,length=182113224> ##contig=<id=3,length=160039680> ##contig=<id=4,length=156508116> ##contig=<id=5,length=151834684> ##contig=<id=6,length=149736546> ##contig=<id=7,length=145441459> ##contig=<id=8,length=129401213> ##contig=<id=9,length=124595110> ##contig=<id=x,length=171031299> ##FILTER=<id=basequalbias,description="min p-value="" for="" baseq="" bias="" (info="" pv4)="" [0]"=""> ##FILTER=<id=enddistbias,description="min p-value="" for="" end="" distance="" bias="" (info="" pv4)="" [0.0001]"=""> ##FILTER=<id=gapwin,description="window size="" for="" filtering="" adjacent="" gaps="" [3]"=""> ##FILTER=<id=het,description="genotype call="" is="" heterozygous="" (low="" quality)="" []"=""> ##FILTER=<id=mapqualbias,description="min p-value="" for="" mapq="" bias="" (info="" pv4)="" [0]"=""> ##FILTER=<id=maxdp,description="maximum read="" depth="" (info="" dp="" or="" info="" dp4)="" [200]"=""> ##FILTER=<id=minab,description="minimum number="" of="" alternate="" bases="" (info="" dp4)="" [5]"=""> ##FILTER=<id=mindp,description="minimum read="" depth="" (info="" dp="" or="" info="" dp4)="" [5]"=""> ##FILTER=<id=minmq,description="minimum rms="" mapping="" quality="" for="" snps="" (info="" mq)="" [20]"=""> ##FILTER=<id=qual,description="minimum value="" of="" the="" qual="" field="" [10]"=""> ##FILTER=<id=refn,description="reference base="" is="" n="" []"=""> ##FILTER=<id=snpgap,description="snp within="" int="" bp="" around="" a="" gap="" to="" be="" filtered="" [2]"=""> ##FILTER=<id=strandbias,description="min p-value="" for="" strand="" bias="" (info="" pv4)="" [0.0001]"=""> ##FILTER=<id=vdb,description="minimum variant="" distance="" bias="" (info="" vdb)="" [0]"=""> ##FORMAT=<id=dp,number=1,type=integer,description="# high-quality="" bases"=""> ##FORMAT=<id=gl,number=3,type=float,description="likelihoods for="" rr,ra,aa="" genotypes="" (r="ref,A=alt)""> ##FORMAT=<id=gq,number=1,type=integer,description="genotype quality"=""> ##FORMAT=<id=gt,number=1,type=string,description="genotype"> ##FORMAT=<id=pl,number=g,type=integer,description="list of="" phred-scaled="" genotype="" likelihoods"=""> ##FORMAT=<id=sp,number=1,type=integer,description="phred-scaled strand="" bias="" p-value"=""> ##FORMAT=<id=fi,number=1,type=integer,description="pass(1) or="" fail="" (0)="" filter"=""> ##INFO=<id=ac1,number=1,type=float,description="max-likelihood estimate="" of="" the="" first="" alt="" allele="" count="" (no="" hwe="" assumption)"=""> ##INFO=<id=af1,number=1,type=float,description="max-likelihood estimate="" of="" the="" first="" alt="" allele="" frequency="" (assuming="" hwe)"=""> ##INFO=<id=dp,number=1,type=integer,description="raw read="" depth"=""> ##INFO=<id=dp4,number=4,type=integer,description="# high-quality="" ref-forward="" bases,="" ref-reverse,="" alt-forward="" and="" alt-reverse="" bases"=""> ##INFO=<id=indel,number=0,type=flag,description="indicates that="" the="" variant="" is="" an="" indel."=""> ##INFO=<id=mdv,number=1,type=integer,description="maximum number="" of="" high-quality="" non-reference="" bases"=""> ##INFO=<id=mq,number=1,type=float,description="rms mapping="" quality"=""> ##INFO=<id=msd,number=1,type=float,description="maximum depth="" across="" non-ref="" genotypes"=""> ##INFO=<id=pv0,number=1,type=float,description="p-value for="" strand="" bias"=""> ##INFO=<id=pv1,number=1,type=float,description="p-value for="" baseq="" bias"=""> ##INFO=<id=pv2,number=1,type=float,description="p-value for="" mapq="" bias"=""> ##INFO=<id=pv3,number=1,type=float,description="p-value for="" tail="" distance="" bias"=""> ##INFO=<id=pv4,number=4,type=float,description="p-values for="" strand="" bias,="" baseq="" bias,="" mapq="" bias="" and="" tail="" distance="" bias"=""> ##INFO=<id=qd,number=1,type=float,description="quality by="" depth"=""> ##INFO=<id=sb,number=1,type=float,description="strand bias"=""> ##INFO=<id=vdb,number=1,type=float,description="variant distance="" bias"=""> ##INFO=<id=ac,number=.,type=integer,description="allele count="" in="" genotypes"=""> ##INFO=<id=an,number=1,type=integer,description="total number="" of="" alleles="" in="" called="" genotypes"=""> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 129P2

gatk • 14k views
ADD COMMENTlink modified 3.9 years ago by Biostar ♦♦ 20 • written 5.5 years ago by Tonyzeng300
1

please, edit and format your question.

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum119k
5
gravatar for Pierre Lindenbaum
5.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

(update: ) Use picard sortvcf http://broadinstitute.github.io/picard/command-line-overview.html#SortVcf

 

 

 

I wrote a tool named SortVcfOnRef ( https://github.com/lindenb/jvarkit/wiki/SortVCFOnRef ) to sort a VCF using an indexed reference.

Example:


cat input.vcf |\
   java -jar dist/sortvcfonref.jar  REF=ref.fa |\
   bgzip -c > result.vcf.gz && \
   tabix -p vcf -f result.vcf.gz
ADD COMMENTlink modified 3.9 years ago • written 5.5 years ago by Pierre Lindenbaum119k

Pierre, thank you for the information!

I opened the linker of SortVcfOnRef but could not find where to download sortvcfonref.jar script, did I miss something here?

Thanks again

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Tonyzeng300
2

https://github.com/lindenb/jvarkit

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum119k
2
gravatar for lh3
5.4 years ago by
lh331k
United States
lh331k wrote:
bgzip old.vcf; tabix -pvcf old.vcf.gz
cat chr_list.txt | xargs tabix -h old.vcf.gz > new.vcf
ADD COMMENTlink written 5.4 years ago by lh331k

Really, helpful! Kewl trick!

ADD REPLYlink written 29 days ago by kanika.15160
0
gravatar for Tonyzeng
5.5 years ago by
Tonyzeng300
Tonyzeng300 wrote:

Thank you Pirre, when I run the SortVcfOnRef, it produced an empty file named result.vcf.gz. I think I missed something there. Now I copy most of the commands and process details, thank you!

Here is build.properties file I have edited as

bigwig.dir=/raid1/rzeng/softwares/bigwig
picard.version=1.101
picard.dir=/raid1/rzeng/softwares/picard-tools-${picard.version}
picard.jar=${picard.dir}/picard-${picard.version}.jar
sam.jar=${picard.dir}/sam-${picard.version}.jar`
variant.jar=${picard.dir}/variant-${picard.version}.jar
tribble.jar=${picard.dir}/tribble-${picard.version}.jar
berkeleydb.jar=/raid1/rzeng/softwares/je-5.0.34/lib/je-5.0.34.jar

Then I did

  $ cat build.properties

then,

$ ant sortvcfonref

after it showed " successful"

I run

$ cat New.vcf | java -jar /raid1/rzeng/jvarkit/dist/sortvcfonref.jar REF=genome.fa | bgzip -c > result.vcf.gz && tabix -p vcf -f result.vcf.gz &

It showed like this

 -bash: line 37: bgzip: command not found
ADD COMMENTlink written 5.5 years ago by Tonyzeng300
1

bgzip is not in your PATH , install tabix or you can simply run

java -jar dist/sortvcfonref.jar  REF=ref.fa < input.vcf > output.vcf
ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Pierre Lindenbaum119k
1

This should be a comment, not an answer to your question

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum119k

thank you, Pierre, I run java -jar dist/sortvcfonref.jar REF=ref.fa < input.vcf > output.vcf and generate a 8.0k file output.vcf (removing data line information). My original input.vcf is 1.6G with data line information. also the order of ##contig= still not change according to reference seq.

Did i miss something? thank you very much, Pierre

ADD REPLYlink written 5.5 years ago by Tonyzeng300
1

Run

 java -jar dist/sortvcfonref.jar REF=ref.fa < input.vcf > output.vcf  2> err.txt

what 's in err.txt ?

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum119k

I run it as $ java -jar /raid1/rzeng/jvarkit/dist/sortvcfonref.jar REF=genome.fa < New.vcf > output.vcf 2> err.txt

then err.txt say

Mon Oct 28 15:35:29 CDT 2013] com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef REF=genome.fa    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVE L=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false



[Mon Oct 28 15:35:29 CDT 2013] Executing as rzeng@prism.cluster on Linux 2.6.32-358.18.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_07_01_09_31-b00; Picard version: null

[Mon Oct 28 15:35:29 CDT 2013] Executing as rzeng@prism.cluster on Linux 2.6.32-358.18.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_07_01_09_31-b00; Picard version: null

java.lang.RuntimeException: unknown chromosome 1 in 1    3000185    .    G    T    234.33    PASS    AC1=1;AC=22;AF1=1;AN=36;DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV

1=0.49;PV2=1;PV3=1;PV4=1,0.49,1,1;SB=0.4821;VDB=0.0244 GT:GQ:DP:SP:PL:FI 1/1:99:17:0:216,51,0:1

at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef$VariantComparator.ref(SortVcfOnRef.java:83)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef$VariantComparator.compare(SortVcfOnRef.java:95)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef$VariantComparator.compare(SortVcfOnRef.java:78)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324)
at java.util.TimSort.sort(TimSort.java:203)
at java.util.Arrays.sort(Arrays.java:727)
at net.sf.samtools.util.SortingCollection.spillToDisk(SortingCollection.java:203)
at net.sf.samtools.util.SortingCollection.add(SortingCollection.java:150)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef.doWork(SortVcfOnRef.java:203)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177)
at net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:119)
at com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef.main(SortVcfOnRef.java:231)

[Mon Oct 28 15:35:30 CDT 2013] com.github.lindenb.jvarkit.tools.sortvcfonref.SortVcfOnRef done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=2025979904

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Tonyzeng300
1

and there is a chromosome "1" in "genome.fa " ???

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum119k

I grep all the words with chr* in genome.fa, it showed as following

chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY

in my variants VCF, it shows #chrom 1, 10 , ....

Maybe, I should change #chrom to #chr as the same as genome.fa?

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Tonyzeng300

yes, you should

ADD REPLYlink written 5.5 years ago by Pierre Lindenbaum119k

thank you Pirrer

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Tonyzeng300

I have changed "chrom" of input.vcf to "chr" as follows

#CHR    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    129P2
1    3000126    .    G    T    51.33    Qual;MinAB;EndDistBias;MinDP    
AC1=1;AC=34;AF1=1;AN=36;DP4=4,0,77,11;DP=237;MDV=99;MQ=42;MSD=6;PV0=1;PV1=1;PV2=
1;PV3=0.094;PV4=1,1,1,0.094;QD=0.0061;SB=0.3000;VDB=0.0038    GT:GQ:DP:SP:PL:F
I    1/1:99:6:0:82,3,0:1
1    3000185    .    G    T    234.33    PASS    AC1=1;AC=22;AF1=1;AN=36;
DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV1=0.49;PV2=1;PV3=1;PV4=1,0.49

Then, I rerun sortvcfonref.jar using input.vcf that has been changed

java -jar dist/sortvcfonref.jar  REF=ref.fa < input.vcf > output.vcf

It still showed the same information as,

java.lang.RuntimeException: unknown chromosome 1 in 1    3000185    .    G    T    234.33    PASS    AC1=1;AC=22;AF1=1;AN=36;DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV1=0.49;PV2=1;PV3=1;PV4=1,0.49,1,1;SB=0.4821;VDB=0.0244    GT:GQ:DP:SP:PL:FI    1/1:99:17:0:216,51,0:1

Then, I go back to grep the line which has unknown chromosome 1 and it showed as follows

1       3000185 .       G       T       234.33  PASS    AC1=1;AC=22;AF1=1;AN=36;DP4=134,13,186,26;DP=377;MDV=0;MQ=54;MSD=0;PV0=1;PV1=0.49;PV2=1;PV3=1;PV4=1,0.49,1,1;SB=0.4821;VDB=0.0244       GT:GQ:DP:SP:PL:FI       1/1:99:17:0:216,51,0:1

Do not find anything special here in the line of input.vcf and feel confused now.

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Tonyzeng300
0
gravatar for kreitzman.maayan
5.5 years ago by
kreitzman.maayan0 wrote:

I found this tool, for resorting a vcf according to a reference .dict file, simple to use: http://code.google.com/p/vcfsorter/

ADD COMMENTlink written 5.5 years ago by kreitzman.maayan0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1381 users visited in the last hour