joint callset and vcf sorting, unknown TAG issue
1
0
Entering edit mode
20 days ago
Matteo Ungaro ▴ 100

Hi there, I'm working on a joint call-set of 47 VCFs which I will be merging with GLNexus. Now, I've done this before but, for some reason, since I've added 2 extra samples to the original 45 – total 47 – there have been few issues.

The original 45 samples are from the SGDP called with UnifiedCaller the 2 extra are archaic Neanderthal and Denisova, which have been called with the same pipeline on hs37d5. I happened to have the 45 samples sorted since I needed this for another task, so I thought to move on and sort the archaic as well just to speed-up the merging.

Unfortunately, upon attempting to sort I've been presented with the following:

Writing to 3.sgdp_hg19/arch_001/tempa2tnzI -> this is Neanderthal
[W::bcf_hrec_check] Invalid tag name: "1000gALT"
[W::vcf_parse_info] INFO '.' is not defined in the header, assuming Type=String
[W::bcf_hrec_check] Invalid tag name: "."
Error encountered while parsing the input at 1:121387974
Cleaning

Writing to 3.sgdp_hg19/arch_002/tempx8KiuF -> this is Denisova
[W::bcf_hrec_check] Invalid tag name: "1000gALT"
[W::vcf_parse_info] INFO '.' is not defined in the header, assuming Type=String
[W::bcf_hrec_check] Invalid tag name: "."
Error encountered while parsing the input at 1:2590169
Cleaning

I double-checked that 1000gALT TAG in the header and in the body's entries and it appear to be present which leave me perplexed about 'bcftools' rising that issue at first.
Second, I don't know/understand to what the INFO '.' is not defined in the header and the Invalid tag name: "." refer to... I checked the input lines 1:121387974 and 1:2590169 but they look fine to me...

Is there any way to prevent this issues that stop me from sorting the two files so that I can then merge the 47 samples? I was looking into bcftools annotate but I tested it on Neanderthal and got this:

[W::bcf_hrec_check] Invalid tag name: "1000gALT"
[W::bcf_hrec_check] Invalid tag name: "1000gALT"
Warning: The tag "." not defined in the header
[W::vcf_parse_info] INFO '.' is not defined in the header, assuming Type=String
[W::bcf_hrec_check] Invalid tag name: "."
Encountered an error, cannot proceed. Please check the error output above.
If feeling adventurous, use the --force option. (At your own risk!)

which always goes back to the same problem. To be noted, that these two archaic VCFs had already and issue in the FORMAT filed which I had to fix by reheading them; however, these two are beyond my understanding. If anyone has any clue, I'll be very happy to try and figure out what's going on, thanks in advance!

P. S. simply merging the files seems pointless, as the process aborts for the same exact issues

sort bcftools GLNexus merge VCF • 456 views
ADD COMMENT
0
Entering edit mode

Show us the content of the INFO column at the problematic positions.

ADD REPLY
0
Entering edit mode

My bad I switched positions between the two files when looking the first time. They do indeed look abnormal but how can I fix it? Below the Neanderthal:
1 121387974 . C . . . .;CpG GT:A:C:G:T:IR ./.:0,0:0,0:0,0:1,0:0

and Denisova positions of interest:
1 2590169 . C . . . .;RM;TS=HPOM;CAnc=C;OAnc=-;rMac=-;mSC=0.300;Map20=0.25 GT:A:C:G:T:IR ./.:0,0:0,1:0,0:0,0:0

Also, this issue could be pervasive... how can I fix it file-wise? Thanks in advance @Pierre Lindenbaum

ADD REPLY
0
Entering edit mode

@Pierre Lindenbaum as I don't know the meaning of .; and I've seen there are many in both files, I simply removed them with grep -v.
I believe there may be a way with bcftools but I'm not sure how, also I can't find any relevant information on the GATK guide about VCF files to add this detail to my headers as in the previous instance.
Still, I can't understand the Invalid tag name: "1000gALT" line; if I missed something please let me know, thanks!

ADD REPLY
0
Entering edit mode

Still, I can't understand the Invalid tag name: "1000gALT"

the following line is missing in the VCF header:

##INFO=<ID=1000gALT,Number=0,Type=Flag,Description="xxx">

OR/AND

the syntax of the TAG is wrong (tag starting with a number ?)

ADD REPLY
0
Entering edit mode

@Pierre Lindenbaum I see. I looked up and, although present, the TAG goes like this in both files: ##INFO=<ID=1000gALT,Number=0,Type=Flag,Description="Alternative allele referred to by 1000G">

Looking around a bit I think the key is to set the Number= to 0 instead of 1; sorry but until now I wasn't aware of the difference between the two. EDIT: probably I should also change the Type to Flag

ADD REPLY
0
Entering edit mode

set the Number= to 0 instead of 1;

I'm not sure it's a problem.

probably I should also change the Type to Flag

ah yes, my bad ! (fixed)

ADD REPLY
2
Entering edit mode
20 days ago

INFO at 121387974 is

 .;CpG

this is an invalid INFO column. There is an invalid attribute named '.' followed by the attribute CpG, while I thing you just wanted 'CpG'. The problem comes from the way this CpG tag was added.

Same for INFO:

 .;RM;....

How to fix this ? The best is to check/fix the tool use to set those attributes. Otherwise you could use sed to remove those dots.

ADD COMMENT
0
Entering edit mode

@Pierre Lindenbaum much appreciated, thanks for the explanation and possible approaches to solve this.
I'm not in the position to run UnifiedCaller on these samples due to time and restrictions on my end while working on a busy HPC cluster; these files were, in fact, downloaded as VCFs from the repository of the Max Planck Institute which generated them. I'm simply surprised they haven't done a sanity check before putting them out...

ADD REPLY

Login before adding your answer.

Traffic: 1141 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6