Question: Is there a FASTA file containing 1000 Genomes variant information to check reference alleles in VCF files?
1
gravatar for rodd
12 months ago by
rodd100
London, United Kingdom
rodd100 wrote:

Hi folks!

I've imputed genotype data using the Michigan Imputation Server (MIS), using the 1000 Genomes Phase 1 panel (not many errors were found, according to MIS). After (and before) imputation, I wanted to perform a sanity check by running checkVCF.py [https://github.com/zhanxw/checkVCF], to make sure the ref/alt alleles in my data were consistent with 1000G Phase1 data. This analysis revealed several inconsistent reference sites, when comparing to this fasta file from 1000G Phase1. Upon close inspection, I noticed that the reference alleles for several SNPs which were "supposedly inconsistent" in my vcf were actually consistent with the data in UCSC Browser, suggesting me that I was using the wrong fasta file as reference for checkVCF.py. I saw that this person also had a similar issue, but I could not find an answer regarding which fasta file I should use as reference for checkVCF.py (or for other tools, like "bcftools norm --check-ref").

I found this link which says that 1000 Genomes doesn't provide fasta files containing variant information, so the file I used as reference for checkVCF.py was not right in the first place. Any clue anyone?

ADD COMMENTlink modified 11 months ago by chrchang5236.9k • written 12 months ago by rodd100
1
gravatar for chrchang523
11 months ago by
chrchang5236.9k
United States
chrchang5236.9k wrote:

I've had good results following the guidance in http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use . I'd guess that either of the first two fasta files listed there should work; let me know if that isn't the case.

ADD COMMENTlink written 11 months ago by chrchang5236.9k

@rodd I am also facing same issue, as per @ chrchang523 reply I had tried using reference from "http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use" but still the problem persist.

I have some doubts in understanding the checkVCF.py report, could anyone please anyone please help me the understand what exactly going wrong.

Things I had done using checkVCF.py

Files used

  1. Reference file: GRCh38_full_analysis_set_plus_decoy_hla.fa (This same reference file used for BWA mem alignment)
  2. Sample1_chr22.vcf.gz (vcf version: VCFv4.2)

Command used:

---1st command--
script was not working in python3
python2.7 ../checkVCF.py -r ../GCA_000001405.15_GRCh38_full_analysis_set.fna.gz -o chry ../../Samples_datatstore_chr22.vcf

--------------- ACTION ITEM ---------------
Please use the following command to clean your VCF file and then re-run checkVCF.py
(grep ^"#" $your_old_vcf; grep -v ^"#" $your_old_vcf | sed 's:^chr::ig' | sort -k1,1n -k2,2n) | bgzip -c > $your_vcf_file

---2nd command--

grep ^"#" Samples_datatstore_chr22.vcf; grep -v ^"#" Samples_datatstore_chr22.vcf | sed 's:^chr::ig' | sort -k1,1n -k2,2n >Modified_chr22.vcf

Note: VCF header and the columns of a VCF (header, and is tab separated into 8 mandatory columns and sample name) was absent in Modified_chr22.vcf generated.

---3nd command--

python2.7 ../checkVCF.py -r ../GCA_000001405.15_GRCh38_full_analysis_set.fna.gz -o chry ../../Modified_chr22.vcf

Output:


Line [ 58801 ] does not have correct column number, exiting!
Line [ 58802 ] does not have correct column number, exiting!
Line [ 58803 ] does not have correct column number, exiting!
Line [ 58804 ] does not have correct column number, exiting!

Line [ 265558 ] does not have correct column number, exiting!

--------------- REPORT ---------------
Total [ 265558 ] lines processed
01-09 13:56 Examine [ 0 ] VCF header lines, [ 265558 ] variant sites, [ 0 ] samples
01-09 13:56 [ 0 ] duplicated sites
01-09 13:56 [ 50050 ] NonSNP site are outputted to [ chr22/chr22_test.check.nonSnp ]
01-09 13:56 [ 215508 ] Inconsistent reference sites are outputted to [ chr22/chr22_test.check.ref ]
01-09 13:56 [ 0 ] Variant sites with invalid genotypes are outputted to [ chr22/chr22_test.check.geno ]
01-09 13:56 [ 0 ] Alternative allele frequency > 0.5 sites are outputted to [ chr22/chr22_test.check.af ]
01-09 13:56 [ 0 ] Monomorphic sites are outputted to [ chr22/chr22_test.check.mono ]
01-09 13:56 --------------- ACTION ITEM ---------------
01-09 13:56 * Read chr22/chr22_test.check.ref, for autosomal sites, make sure the you are using the forward strand 01-09 13:56 * Upload these files to the ftp: chr22/chr22_test.check.log chr22/chr22_test.check.dup chr22/chr22_test.check.noSnp chr22/chr22_test.check.ref chr22/chr22_test.check.geno chr22/chr22_test.check.af chr22/chr22_test.check.mono

Doubts:

  1. what is this Inconsistent mean? I understand that, Inconsistent mean the sites do not match to reference site. If so, according to checkVCF.py report only 50050 reference sites are matched to query set?
  2. According to report, if the samples is aligned only very small part to reference then why the samtools flagstat showed "mapped (96.72%)" for particular sample?

Thank you for your time.

ADD REPLYlink modified 4 months ago • written 4 months ago by Nitha10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1831 users visited in the last hour