Getting VCF file for the 1000 genome project with rsid
0
0
Entering edit mode
13 months ago
Decedious • 0

Hello,

I am quite new to programming and bioinformatics. I am trying to access some VCF files from the 1000 genomes project and follow along a YouTube tutorial (OMGenomics) to do some analysis. I would also like to learn how to use the ensemble api later on as well so I would like to have the rsID within the VCF file. The problem is the current build is grch38 and I was not able to find VCF file with both sample data and rsID value for the latest build.

I couldn't really check the high coverage (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/) because of its massive size.

I thought maybe I could use grch37 vcf files but with v5b(http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) they removed the rsID values and said mapping was available in ensemble. I used ensemble (http://ftp.ensembl.org/pub/grch37/release-105/variation/vcf/homo_sapiens/). There seems to be more snps than the VCF file (v5b) I got from 1000 genome website. Not sure if i have right data or you have to map it differently. Although, I can merge them so the location and alleles match and add the ids, not sure if there would be some conflicts. I did manage to find a v5a(https://ftp.ncbi.nih.gov/1000genomes/ftp/release/20130502/) from ncbi site but when I tried to look up some rsID on the browser, it found no matches or wrong information like the position on the site, guessing due to updates.

What I ideally want is a VCF file for chr21(since its the smallest) with sample genotype and rsID values ideally using grch38 or the latest grch37 that matches ensemble.

If I am making any mistakes when choosing the files, please let me know.

Thanks for the help

1000 vcf genome grch38 grch37 rsid • 664 views
1
Entering edit mode

The file 1000GENOMES-phase_3.vcf.gz that contains variants from 1000Genomes is available in ENSEMBL at this link: https://ftp.ensembl.org/pub/release-105/variation/vcf/homo_sapiens/

However, if you are looking for variants from a particular chromosome those VCF files contain variants from multiple resources including 1000Genomes. Since you are only interested in variants from 1000Genomes, you could just filter them to only keep lines that contain E_1000G in the INFO column

0
Entering edit mode

Thank you,

Does the 1000GENOMES-phase_3.vcf.gz file contain the sample genotype data as well?

I tried the individual chromosome VCF file and unfortunately it does not contain any sample data. I don't quite understand the versions of different files so not sure how to map to v5b from ftp(http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) which does contain sample genotype data but that's on reference build 37.