Question: vcf-merge is giving me a few SNPs in the merged vcf file
1
gravatar for shinken123
3.1 years ago by
shinken12380
México
shinken12380 wrote:

Hi

I am using vcf-merge to join to vcf files, one contain the data of 56 individuals and the other of only one. One of my files have 524754 SNPs and the other 105878 however when I join them I obtain just a few SNPs ~10000 SNPs and I obtain the next message error:

Wrong number of values in LANMLR17B064:250550525/PL at 1:21709201 .. nAlleles=4, nValues=3.
Expected 10 values for diploid genotypes or 4 for haploid genotypes.

at /LUSTRE/storage/data/software/vcftools/lib/perl5/site_perl/Vcf.pm line 172, <$__ANONIO__> line 10372.
Vcf::throw(Vcf4_0=HASH(0x7b12a8), "Wrong number of values in LANMLR17B064:250550525/PL at 1:2170"...) called at /LUSTRE/storage/data/software/vcftools/lib/perl5/site_perl/Vcf.pm line 1767
VcfReader::parse_AGtags(Vcf4_0=HASH(0x7b12a8), HASH(0xb4d080)) called at /LUSTRE/storage/data/software/vcftools/bin/vcf-merge line 464
main::merge_vcf_files(HASH(0xa4ea10)) called at /LUSTRE/storage/data/software/vcftools/bin/vcf-merge line 12

Do you know what is happening here? Why I obtain just a few SNPs in the merged file?

snp • 1.5k views
ADD COMMENTlink modified 23 months ago by olavur100 • written 3.1 years ago by shinken12380

Posting the lines containing '1:21709201' (for example, search for it using grep) could help - sounds like that SNP has some weird alleles in your two files

ADD REPLYlink written 3.1 years ago by Philipp Bayer6.5k

Ok thanks, I removed some individuals that I suspect have sequencing errors but the problem is the same, here are the first lines of my two files that I am merging:

 zcat PTxB73BC1S5RIL_0.1_0.9_56ind_V3_sorted.vcf.gz | grep -v "#" | head -n 2

1 10045 S1_10045 G C . PASS DP=36 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:79:0,6,72 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:2,0:2:79:0,6,72 0/0:2,0:2:79:0,6,72 ./.:0,0:0 0/0:2,0:2:79:0,6,72 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:79:0,6,72 0/0:3,0:3:88:0,9,108 0/0:1,0:1:66:0,3,36 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 1/1:0,1:1:66:36,3,0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:3,0:3:88:0,9,108 0/0:2,0:2:79:0,6,72 1/1:0,2:2:79:72,6,0 ./.:0,0:0 ./.:0,0:0 1/1:0,1:1:66:36,3,0 1/1:0,1:1:66:36,3,0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:2,0:2:79:0,6,72 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 ./.:0,0:0

1 157465 S1_157465 C T . PASS DP=42 GT:AD:DP:GQ:PL ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:2,0:2:79:0,6,72 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 1/1:0,1:1:66:36,3,0 ./.:0,0:0 ./.:0,0:0 1/1:0,1:1:66:36,3,0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:79:0,6,72 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 0/0:2,0:2:79:0,6,72 ./.:0,0:0 ./.:0,0:0 1/1:0,1:1:66:36,3,0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:2,0:2:79:0,6,72 0/0:1,0:1:66:0,3,36 ./.:0,0:0 0/0:3,0:3:88:0,9,108 0/0:1,0:1:66:0,3,36 1/1:0,3:3:88:108,9,0 ./.:0,0:0 ./.:0,0:0 1/1:0,1:1:66:36,3,0 1/1:0,1:1:66:36,3,0 0/0:3,0:3:88:0,9,108 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 ./.:0,0:0 1/1:0,3:3:88:108,9,0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0

 zcat PTxB73F1_filtrado_sorted.vcf_bgzip.gz | grep -v "#" | head -n 2

1 2064 . C CTTT 622.73 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-0.347;ClippingRankSum=-1.807;DP=94;ExcessHet=3.0103;FS=4.769;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.619;QD=10.04;ReadPosRankSum=1.882;SOR=0.450 GT:AD:DP:GQ:PL 0/1:35,27:62:99:660,0,1325

1 2179 . T C 2109.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=1.590;ClippingRankSum=-1.191;DP=119;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.548;QD=17.73;ReadPosRankSum=3.280;SOR=0.633 GT:AD:DP:GQ:PL 0/1:49,70:119:99:2138,0,1187

ADD REPLYlink written 3.1 years ago by shinken12380

Do you still have the same error involving that particular SNP at 21709201? What do the files print there?

ADD REPLYlink written 3.1 years ago by Philipp Bayer6.5k

Ok, I got it,

I have this:

zcat PTxB73RIL_0.1_0.9_el_bueno_V3_sorted.vcf.gz | grep -v "#" | grep "21709201" | head

1 21709201 S1_21710020 CA NT,NG,N . PASS DP=341 GT:AD:DP:GQ:PL 0/0:5,0,0,0:5:96:0,15,180 0/1:3,2,0,0:5:99:57,0,93 0/0:4,0,0,0:4:94:0,12,144 0/1:2,2,0,0:4:99:60,0,60 0/0:9,0,0,0:9:99:0,27,255 0/0:4,0,0,0:4:94:0,12,144 ./.:0,0,0,0:0 0/0:1,0,0,0:1:66:0,3,36 0/0:3,0,0,0:3:88:0,9,108 0/1:5,3,0,0:8:99:84,0,156 0/0:4,0,0,0:4:94:0,12,144 ./.:0,0,0,0:0 0/1:3,1,0,0:4:99:24,0,96 0/0:5,0,0,0:5:96:0,15,180 0/0:6,0,0,0:6:98:0,18,216 0/0:4,0,0,0:4:94:0,12,144 0/0:7,0,0,0:7:99:0,21,252 0/0:4,0,0,0:4:94:0,12,144 0/0:2,0,0,0:2:79:0,6,72 0/0:3,0,0,0:3:88:0,9,1080/0:2,0,0,0:2:79:0,6,72 0/0:2,0,0,0:2:79:0,6,72 0/0:6,0,0,1:7:98:0,18,216 0/0:4,0,0,0:4:94:0,12,144 0/1:3,1,0,0:4:99:24,0,960/0:3,0,0,0:3:88:0,9,108 0/0:4,0,0,0:4:94:0,12,144 0/3:3,0,0,1:4:88:0,9,108 ./.:0,0,0,0:0 1/1:0,4,0,0:4:94:144,12,0 0/0:1,0,0,0:1:66:0,3,36 0/0:3,0,0,0:3:88:0,9,108 0/0:2,0,0,0:2:79:0,6,72 0/0:4,0,0,0:4:94:0,12,144 0/0:3,0,0,0:3:88:0,9,108 1/0:3,6,0,0:9:99:189,0,81 0/3:1,0,0,1:2:66:0,3,36 1/0:1,2,0,0:3:99:63,0,27 ./.:0,0,0,0:0 0/0:5,0,0,0:5:96:0,15,180 0/0:3,0,0,0:3:88:0,9,108 0/1:1,1,0,0:2:99:30,0,30 0/0:1,0,0,0:1:66:0,3,36 0/0:5,0,0,0:5:96:0,15,180 0/1:3,1,0,0:4:99:24,0,96 1/1:1,6,0,0:7:96:195,0,15 ./.:0,0,0,0:0 1/0:1,2,0,0:3:99:63,0,27 ./.:0,0,0,0:0 2/2:0,0,1,0:1:33:0,0,0 0/0:6,0,0,0:6:98:0,18,216 0/1:1,1,0,0:2:99:30,0,30 0/0:4,0,0,0:4:94:0,12,144 0/0:3,0,0,0:3:88:0,9,108 1/1:0,1,0,0:1:66:36,3,0 0/0:3,0,0,0:3:88:0,9,108 1/2:0,3,2,0:5:88:108,9,0 0/0:2,0,0,0:2:79:0,6,72 0/0:2,0,0,0:2:79:0,6,720/0:6,0,0,0:6:98:0,18,216 0/0:2,0,0,0:2:79:0,6,72 0/1:8,4,0,0:12:99:108,0,252 0/0:3,0,0,0:3:88:0,9,108 0/0:5,0,0,0:5:96:0,15,180 0/0:2,0,0,0:2:79:0,6,72 1/0:2,3,0,0:5:99:93,0,57 0/0:2,0,0,0:2:79:0,6,72 0/3:2,0,0,2:4:79:0,6,72 0/0:9,0,0,0:9:99:0,27,255 0/0:3,0,0,0:3:88:0,9,108 0/0:9,0,0,0:9:99:0,27,255 0/0:5,0,0,0:5:96:0,15,180 1/0:1,5,0,0:6:98:162,0,18 1/0:1,2,0,0:3:99:63,0,27 0/0:8,0,0,0:8:99:0,24,255 0/0:4,0,0,0:4:94:0,12,144 0/0:3,0,0,0:3:88:0,9,108 0/0:10,0,0,0:10:99:0,30,255 0/0:3,0,0,0:3:88:0,9,108 0/0:7,0,0,1:8:99:0,21,252 0/0:4,0,0,0:4:94:0,12,144 ./.:0,0,0,0:0 0/1:2,2,0,0:4:99:60,0,60 0/0:5,0,0,0:5:96:0,15,180 0/0:5,0,0,0:5:96:0,15,180 0/0:3,0,0,0:3:88:0,9,108 0/0:3,0,0,0:3:88:0,9,108 0/0:3,0,0,0:3:88:0,9,108

and for LANMLR17B064:250550525 in this line I have

zcat PTxB73RIL_0.1_0.9_el_bueno_V3_sorted.vcf.gz | grep -v "#" | grep "21709201" | head | awk '{print $46}'

0/3:1,0,0,1:2:66:0,3,36

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by shinken12380

Also this problems could be because my vcf files have different versions one is VCFv4.0 and the other VCFv4.2. What do you think? I also know that in the GT filed we could have 0/0, 0/1 and 1/1 but in this case we have 0/3 that is weird right?

ADD REPLYlink written 3.1 years ago by shinken12380

Is the 0/3 in the second file for a different alternative allele? In the first file, 0/3 means CA/N, what does it mean in the second file?

ADD REPLYlink written 3.1 years ago by Philipp Bayer6.5k

Ok, thanks.

In V4.0:

GT genotype, encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid).

And in v4.2 the definition is the same:

GT : genotype, encoded as allele values separated by either of / or |. The allele values are 0 for the reference allele (what is in the REF field),  1 for the first allele listed in ALT, 2 for the second allele list in ALT and so  on.   For  diploid  calls  examples  could  be  0/1,  1|0,  or  1/2,  etc.   For  haploid  calls,  e.g.   on  Y,  male  non-pseudoautosomal  X,  or mitochondrion,  only  one  allele  value  should  be  given;  a  triploid  call  might  look  like 0/0/1.  If a call cannot be made for a sample at a given locus, '.'  should be specified for each missing allele in the GT field (for example './.' for a diploid genotype and '.'  for haploid genotype).

So I guess that I am not going to have problems with that.

At the end I use bcftools and I removed the PL and GQ fields. For the rest It looks that the data is the same and can be combined. What do you think?

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by shinken12380
1
gravatar for Claire Malley
3.0 years ago by
National Institutes of Health, Rockville, MD
Claire Malley40 wrote:

Hi Shinken,

I just encountered this error as well with more than 2 alternate alleles for some variants. I found that bcftools merge was able to run fine, and gave the entire length of the vcf files, whereas the vcftools vcf-merge truncates the file. See: https://samtools.github.io/bcftools/bcftools.html#merge

ADD COMMENTlink written 3.0 years ago by Claire Malley40

Thank you, yes bcftools is working fine.

ADD REPLYlink written 3.0 years ago by shinken12380

This didn't work for me, with bcftools v1.6. I did find another solution, however, which I added an answer for.

ADD REPLYlink written 23 months ago by olavur100
0
gravatar for olavur
23 months ago by
olavur100
Tórshavn, Faroe Islands
olavur100 wrote:

One possible solution to this problem is to exclude multiallelic sites.

In my case, the problem was caused by a site that had 3 possible ALT alleles for a single individual, which means there are 10 (n (n + 1) / 2) possible genotypes, but the GL (genotype likelihood) field only contained 3 values. Such a site will probably always be filtered out, if nothing else then due to allele fraction, but I wanted to remove such sites more deliberately.

So my solution to this problem is this. Simply filter out multiallelic sites, which using bcftools looks like this:

bcftools view --max-alleles 2 sample.vcf.gz | bgzip -c > filtered.vcf.gz

Do this for all your samples, and then merge them.

ADD COMMENTlink written 23 months ago by olavur100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1656 users visited in the last hour