Can't liftover vcf file from hg19 to hg38
1
1
Entering edit mode
14 months ago
Victor ▴ 10

Hello everyone

I have a vcf file that I'm trying to convert from hg19 to hg38. For that I'm using bcftools +liftover command from here. I previously tried to use picard VCF but the memory cost was too much from the amount of data.

Whatever I do I end up running into a different error, if I try to run with hg19ToHg38.over.chain as the chain I get the error "The INFO tag "AF" is not the correct AGR tag".

If I try to use GRCh37_to_GRCh38.chain instead I get the error "Could not parse integer 167417 50000 80249 in the chain file: convert19-38/GRCh37_to_GRCh38.chain" instead.

I already tried using different references for both hg19 and hg38. for hg19 I used both goldenpath's hg19.fa and human_g1k_v37.fasta. for hg38 I tried using hg38.fa, GRCh38_full_analysis_set_plus_decoy_hla.fa, Homo_sapiens.GRCh38.dna.primary_assembly.fa, hg38Patch11.fa and GCA_000001405.15_GRCh38_no_alt_analysis_set.fna with both chain files

Here is the command I'm trying to run:

bcftools +liftover -Oz -o 510k_hg38.vcf output_510k/concat.dose.vcf.gz -- -s ref37/human_g1k_v37.fasta -f ref38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -c convert19-38/hg19ToHg38.over.chain

As mentioned, I already tried different combinations of chain files, ref37 and ref38. The vcf file is already inputed to hrc 1.1 with michigan inputation server if that makes a difference.

bcftools • 2.2k views
ADD COMMENT
0
Entering edit mode

The first error is probably related to VCF versioning. What version is your VCF file?

ADD REPLY
0
Entering edit mode

it says fileformat=VCFv4.1 on the vcf file

ADD REPLY
0
Entering edit mode

Weird. That error doesn't seem like a known thing either. Can you try a method from this thread: Lift-over on a VCF

The error looks like the result from a length check (https://github.com/freeseek/score/blob/1fcbc3e0d6dd600d871ebe1a1ffdf34672ad1e1d/liftover.c#L460), can you figure out which record gives you that error? The full error message should say something more specific about the record.

ADD REPLY
0
Entering edit mode

Sorry, I'm not sure how to find more about the record. that is the whole error message, it always happens on that same postion "167417 50000 80249" no matter what references I use, I will try the other methods this weekend and tell how it goes, thanks for the suggestion!

ADD REPLY
0
Entering edit mode

Have you tried going back to first principles, and creating a bed file from your VCF using the ID field (or ${chr}_${start}_${stop}_${ref}_${alt}) as names; lifting the bed over using ucsc liftOver; and then re-mapping the positions? It's slightly more involved will not break in the same way.

ADD REPLY
0
Entering edit mode

I haven't, I will try that and post the results. Probably should've asked for help earlier because I won't be able to finish it today, but thanks for the suggestion!

ADD REPLY
5
Entering edit mode
14 months ago

The error message from bcftools +liftover needs to be improved but this is an issue with the header of the VCF. You should have something like:

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency">

But maybe you have something like:

##INFO=<ID=AF,Number=1,Type=Float,Description="Allele frequency">

So maybe try the following to see if it resolves the issue:

zcat output_510k/concat.dose.vcf.gz | sed 's/^\(##INFO=<ID=AF,Number=\)./\1A/' | \
bcftools +liftover -Oz -o 510k_hg38.vcf.gz -- \
  -s ref37/human_g1k_v37.fasta \
  -f ref38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  -c convert19-38/hg19ToHg38.over.chain

I will make sure the next version of bcftools +liftover gives a better error message. How did you create the output_510k/concat.dose.vcf.gz VCF file?

ADD COMMENT
0
Entering edit mode

So you need VCFv4.2?

ADD REPLY
1
Entering edit mode

Number=A was introduced in VCFv4.1 and VCFv4.3 requires the INFO/AF field to be a Number=A field. But it does not really matter what version is the VCF as BCFtools will not check whether a VCF is consistent with the specification version. In this case you just need to fix the length type of INFO/AF or alternatively (but not encouraged) prevent the INFO/AF field from being updated with option --af-tags .

ADD REPLY
0
Entering edit mode

I wonder how a VCF with such inconsistent headers is generated in the first place.

ADD REPLY
1
Entering edit mode

Technically the VCF is not inconsistent as version VCFv4.1 does not mandate the INFO/AF field as a Number=A field. What is going on here is that the imputed VCF only contains bi-allelic variants and for those type of variants there is no difference between a Number=1 field and a Number=A field. However it would be nice to see the imputation server adopting the latest specifications rules

ADD REPLY
0
Entering edit mode

we took the data from a study, inputed according to the michigan inputation server data preparation, then joined the chromossome files with bcftools concat

I think I'm starting to see the problem, you were spot on on how the header was. using the command you sent indeed solved the error, this time I got "The INFO tag "DS" is not the correct AGR tag", so I will try doing a similar thing on other problematic parts. thank you so very much for the help

in case you are wondering, this is how that part of my header is:

ADD REPLY
1
Entering edit mode

Then try the following for now:

zcat output_510k/concat.dose.vcf.gz | \
  sed 's/^\(##INFO=<ID=AF,Number=\)./\1A/;s/^\(##FORMAT=<ID=DS,Number=\)./\1A/' | \
bcftools +liftover -Oz -o 510k_hg38.vcf.gz -- \
  -s ref37/human_g1k_v37.fasta \
  -f ref38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
  -c convert19-38/hg19ToHg38.over.chain

I will make sure the next version prevents these cases from being an issue by just throwing a warning instead [edited to change INFO to FORMAT for DS]

ADD REPLY
0
Entering edit mode

For some reason, the DS field keeps getting the same error, I tried to subtitute 1 for a dot "." instead of A but also to no avail.

also, correcting my mistake from earlier, the error I got wasn't

The INFO tag "DS" is not the correct AGR tag

but rather:

The FORMAT tag "DS" is not the correct AGR tag
ADD REPLY
0
Entering edit mode

that worked wonders, although right now I have another problem, the command runs for a couple seconds then gives segmentation fault with no additional message or information and stops. it also happens exactly when the output file is at 10 megabytes every time. I checked and everything is up to date. also tried to upgrade the virtual machine and multithread in case it was because of memory, this just made it reach the 10 mb and give the error faster. I also tried to increase the soft limit of the files open to no avail.

Since your solution worked I'll mark it as solved, maybe you have a clue about the second error? otherwise I'll make another thread specifically to it. thank you very much for the help!

edit: seems it was just a problem with the reference file. I used other reference for the hg38 and it worked fine, will update later if it worked

edit2: changing the reference indeed worked perfectly, weirdly enough using nohup also caused problems there where it stopped the process in a certain point, but without it everything went perfectly

ADD REPLY
0
Entering edit mode

Victor it would be nice to be able to reproduce your issue as it something that should not happen. Is there any chance you could send me something to reproduce it or run the command that gave you the segmentation fault with valgrind, that is, instead of running bcftools +liftover run valgrind bcftools +liftover and send me the output via email (which you can find here)?

ADD REPLY

Login before adding your answer.

Traffic: 2758 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6