Sanger Imputation Service - The input file sanity check failed : faidx_fetch_seq failed at JH636052.4:4111085
Entering edit mode
5.8 years ago
sandKings ▴ 40

I am trying to impute a genotype dataset using the Sanger Imputation Service. The vcf file was created using GATK RNASeq variant calling method. I've moved past a lot of errors but now I'm stuck at :

--- Aborted Job --- The input file sanity check failed, "bcftools norm -ce" exited with the following message: [E::faidx_fetch_seq] The sequence "JH636052.4" not found

faidx_fetch_seq failed at JH636052.4:4111085

My RNASeq data is mapped to GRCh37 and the vcf files are zipped and indexed. When I run +fixref on the vcf file,

$ bcftools +fixref input.vcf.gz -- -f /GRCh37.p13.genome.fa

I get the following report:

# SC, guessed strand convention
SC  TOP-compatible  0
SC  BOT-compatible  0
# ST, substitution types
ST  A>C 601 3.2%
ST  A>G 3641    19.2%
ST  A>T 432 2.3%
ST  C>A 620 3.3%
ST  C>G 833 4.4%
ST  C>T 3340    17.6%
ST  G>A 3361    17.7%
ST  G>C 871 4.6%
ST  G>T 643 3.4%
ST  T>A 423 2.2%
ST  T>C 3560    18.8%
ST  T>G 611 3.2%
# NS, Number of sites:
NS  total           19864
NS  ref match       18936   100.0%
NS  ref mismatch    0   0.0%
NS  skipped         928
NS  non-ACGT        0
NS  non-SNP         925
NS  non-biallelic   3
RNASeq variant calling • 2.0k views
Entering edit mode
5.8 years ago
Michael 54k

This looks like a genome version mismatch. In these situations, it makes a lot of sense to look up the sequence in GenBank: So it looks like one version of the genome in your pipeline contained the patch sequence JH636052.4 while the other tool does contain a different version or none. Simply grep -e 'JH636052.4' on _your_ Fasta file. Even if you were using GRCh37 in both pipelines, you might as well have a different "patch-level". Another possibility is that one assembly contains only reference sequence matched to chromosomes.

Btw: any reason why you are not using the latest genome-build?

Solution: update your pipeline to the latest genome build that is supported by all tools. Suboptimal solution: Make sure you are using the same genome build everywhere first, otherwise your results will be nuts anyway. If all tools use the same reference version, but exclude some of the unplaced sequences, reduce -your- reference sequence to the intersection of sequences available, or only chromosomes (not that good).

Entering edit mode

Hi Micheal, thank you so much for replying. I am using GRCh37 because Sanger Imputation Service requires that the coordinates are on GRCh37. I had everything mapped to GRCh38 but that took me down the path to using liftover files which I didn't want to do.

Where can I find the final build for GRCh37? Would you recommend that I use GRCh37 ( primary assembly) and comprehensive gene annotation from Gencode?


Login before adding your answer.

Traffic: 1977 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6