Question: Is there a FASTA file containing 1000 Genomes variant information to check reference alleles in VCF files?
gravatar for rodd
18 months ago by
London, United Kingdom
rodd130 wrote:

Hi folks!

I've imputed genotype data using the Michigan Imputation Server (MIS), using the 1000 Genomes Phase 1 panel (not many errors were found, according to MIS). After (and before) imputation, I wanted to perform a sanity check by running [], to make sure the ref/alt alleles in my data were consistent with 1000G Phase1 data. This analysis revealed several inconsistent reference sites, when comparing to this fasta file from 1000G Phase1. Upon close inspection, I noticed that the reference alleles for several SNPs which were "supposedly inconsistent" in my vcf were actually consistent with the data in UCSC Browser, suggesting me that I was using the wrong fasta file as reference for I saw that this person also had a similar issue, but I could not find an answer regarding which fasta file I should use as reference for (or for other tools, like "bcftools norm --check-ref").

I found this link which says that 1000 Genomes doesn't provide fasta files containing variant information, so the file I used as reference for was not right in the first place. Any clue anyone?

ADD COMMENTlink modified 18 months ago by chrchang5237.4k • written 18 months ago by rodd130
gravatar for chrchang523
18 months ago by
United States
chrchang5237.4k wrote:

I've had good results following the guidance in . I'd guess that either of the first two fasta files listed there should work; let me know if that isn't the case.

ADD COMMENTlink written 18 months ago by chrchang5237.4k

@rodd I am also facing same issue, as per @ chrchang523 reply I had tried using reference from "" but still the problem persist.

I have some doubts in understanding the report, could anyone please anyone please help me the understand what exactly going wrong.

Things I had done using

Files used

  1. Reference file: GRCh38_full_analysis_set_plus_decoy_hla.fa (This same reference file used for BWA mem alignment)
  2. Sample1_chr22.vcf.gz (vcf version: VCFv4.2)

Command used:

---1st command--
script was not working in python3
python2.7 ../ -r ../GCA_000001405.15_GRCh38_full_analysis_set.fna.gz -o chry ../../Samples_datatstore_chr22.vcf

--------------- ACTION ITEM ---------------
Please use the following command to clean your VCF file and then re-run
(grep ^"#" $your_old_vcf; grep -v ^"#" $your_old_vcf | sed 's:^chr::ig' | sort -k1,1n -k2,2n) | bgzip -c > $your_vcf_file

---2nd command--

grep ^"#" Samples_datatstore_chr22.vcf; grep -v ^"#" Samples_datatstore_chr22.vcf | sed 's:^chr::ig' | sort -k1,1n -k2,2n >Modified_chr22.vcf

Note: VCF header and the columns of a VCF (header, and is tab separated into 8 mandatory columns and sample name) was absent in Modified_chr22.vcf generated.

---3nd command--

python2.7 ../ -r ../GCA_000001405.15_GRCh38_full_analysis_set.fna.gz -o chry ../../Modified_chr22.vcf


Line [ 58801 ] does not have correct column number, exiting!
Line [ 58802 ] does not have correct column number, exiting!
Line [ 58803 ] does not have correct column number, exiting!
Line [ 58804 ] does not have correct column number, exiting!

Line [ 265558 ] does not have correct column number, exiting!

--------------- REPORT ---------------
Total [ 265558 ] lines processed
01-09 13:56 Examine [ 0 ] VCF header lines, [ 265558 ] variant sites, [ 0 ] samples
01-09 13:56 [ 0 ] duplicated sites
01-09 13:56 [ 50050 ] NonSNP site are outputted to [ chr22/chr22_test.check.nonSnp ]
01-09 13:56 [ 215508 ] Inconsistent reference sites are outputted to [ chr22/chr22_test.check.ref ]
01-09 13:56 [ 0 ] Variant sites with invalid genotypes are outputted to [ chr22/chr22_test.check.geno ]
01-09 13:56 [ 0 ] Alternative allele frequency > 0.5 sites are outputted to [ chr22/ ]
01-09 13:56 [ 0 ] Monomorphic sites are outputted to [ chr22/chr22_test.check.mono ]
01-09 13:56 --------------- ACTION ITEM ---------------
01-09 13:56 * Read chr22/chr22_test.check.ref, for autosomal sites, make sure the you are using the forward strand 01-09 13:56 * Upload these files to the ftp: chr22/chr22_test.check.log chr22/chr22_test.check.dup chr22/chr22_test.check.noSnp chr22/chr22_test.check.ref chr22/chr22_test.check.geno chr22/ chr22/chr22_test.check.mono


  1. what is this Inconsistent mean? I understand that, Inconsistent mean the sites do not match to reference site. If so, according to report only 50050 reference sites are matched to query set?
  2. According to report, if the samples is aligned only very small part to reference then why the samtools flagstat showed "mapped (96.72%)" for particular sample?

Thank you for your time.

ADD REPLYlink modified 10 months ago • written 10 months ago by Nitha10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2352 users visited in the last hour