Question

Sanger Imputation Service - The input file sanity check failed : faidx_fetch_seq failed at JH636052.4:4111085

0

Entering edit mode

5.6 years ago

sandKings ▴ 30

I am trying to impute a genotype dataset using the Sanger Imputation Service. The vcf file was created using GATK RNASeq variant calling method. I've moved past a lot of errors but now I'm stuck at :

--- Aborted Job --- The input file sanity check failed, "bcftools norm -ce" exited with the following message: [E::faidx_fetch_seq] The sequence "JH636052.4" not found

faidx_fetch_seq failed at JH636052.4:4111085

My RNASeq data is mapped to GRCh37 and the vcf files are zipped and indexed. When I run +fixref on the vcf file,

$ bcftools +fixref input.vcf.gz -- -f /GRCh37.p13.genome.fa

I get the following report:

# SC, guessed strand convention
SC  TOP-compatible  0
SC  BOT-compatible  0
# ST, substitution types
ST  A>C 601 3.2%
ST  A>G 3641    19.2%
ST  A>T 432 2.3%
ST  C>A 620 3.3%
ST  C>G 833 4.4%
ST  C>T 3340    17.6%
ST  G>A 3361    17.7%
ST  G>C 871 4.6%
ST  G>T 643 3.4%
ST  T>A 423 2.2%
ST  T>C 3560    18.8%
ST  T>G 611 3.2%
# NS, Number of sites:
NS  total           19864
NS  ref match       18936   100.0%
NS  ref mismatch    0   0.0%
NS  skipped         928
NS  non-ACGT        0
NS  non-SNP         925
NS  non-biallelic   3

RNASeq variant calling • 1.9k views

ADD COMMENT • link 5.6 years ago by sandKings ▴ 30

score 1 · Answer 1 · 2018-09-05

This looks like a genome version mismatch. In these situations, it makes a lot of sense to look up the sequence in GenBank: https://www.ncbi.nlm.nih.gov/nuccore/JH636052.4?report=genbank So it looks like one version of the genome in your pipeline contained the patch sequence JH636052.4 while the other tool does contain a different version or none. Simply grep -e 'JH636052.4' on _your_ Fasta file. Even if you were using GRCh37 in both pipelines, you might as well have a different "patch-level". Another possibility is that one assembly contains only reference sequence matched to chromosomes.

Btw: any reason why you are not using the latest genome-build?

Solution: update your pipeline to the latest genome build that is supported by all tools. Suboptimal solution: Make sure you are using the same genome build everywhere first, otherwise your results will be nuts anyway. If all tools use the same reference version, but exclude some of the unplaced sequences, reduce -your- reference sequence to the intersection of sequences available, or only chromosomes (not that good).