Question: Missing Snps For Hba1-Hba2 In 1000 Genome Data??
0
gravatar for Tarbem
9.1 years ago by
Tarbem10
Tarbem10 wrote:

Hey guys,

I wanted to extract the SNPs called for HBA1 and HBA2 genes in 1000 genome project. However, these two genes appear to have no SNPs - no missense or samesense SNPs.

I cross-checked on ensemble (1000 genome browser), and dbSNP reports about 500 non-synonymous SNPs in the exonic region of HBA1.

What are the odds that 629 people in the 1000 genome project happen to have exactly the same coding sequence for HBA1? I guess, very unlikely.

Am I missing something obvious?

ps. Laura, from 1000 genome project, provided some explanation on a similar issue regarding missing genotypes here: http://biostar.stackexchange.com/questions/9550/why-so-many-missing-genotypes-in-1000-genomes-data

But I think the situation in this post is quite different, and I appreciate to hear any idea why the snps are missing.

Thanks

genome dbsnp snp • 2.2k views
ADD COMMENTlink modified 8.6 years ago by lh332k • written 9.1 years ago by Tarbem10
1
gravatar for lh3
9.1 years ago by
lh332k
United States
lh332k wrote:

Because HBA1 and HBA2 are nearly identical in the coding regions. You cannot do much about that with short reads.

ADD COMMENTlink written 9.1 years ago by lh332k
0
gravatar for Jorge Amigo
9.1 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

I must say that I first tried to reproduce your error at the 1000 genomes browser searching for HBA1 and I indeed didn't see the expected variation, but then I realized that the track I was looking at corresponded to 20100804 data, which is what Laura described, and not the latest release. in fact, this is the note at the welcome page of the browser:

The 1000 Genomes Browser

Ensembl-based browser provides early access to 1000genomes data

In order to facilitate immediate analysis of the 1000genomes data by the whole scientific community, this browser (based on Ensembl) integrates the SNP calls from the August 2010 release. This data will be submitted to dbSNP, and once rsid's have been allocated, will be absorbed into the UCSC and Ensembl browsers according to their respective release cycles. Until that point any non rs SNP id's on this site are temporary and will NOT be maintained.

as I really can't give any other advice but to look on the 1000 genomes website for this information, since I haven't found a way to look for this information I can only suggest to digest their raw data as we did. in case you want to save time, you may want to have a look to the the raw genotypes we processed from this latest release (interesting note for any BioStar reader: there are only bi-allelic markers because their genotype caller limits it - we have asked the project to include a note on the readme file to clarify this). if you go to our ENGINES tool and try searching for HBA1 and HBA2 and selecting all 14 available populations, you will end up looking at 26 variants, 20 of them being in dbSNP132 too and 6 of them being new, and having most of them very low MAF values (19 of them are below 0.1). although this is not as much as the 500 sites you were expecting, I really hope this result helps in some way.

ADD COMMENTlink written 9.1 years ago by Jorge Amigo12k

Hey Jorge,

Thanks for your reply, it was very helpful - (I did not know about bi-allelic markers.)

I parsed the following file: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/ASN1_flat/ds_flat_ch16.flat.gz (HBA1 resides on chr16)

... and did not hit any variation at genomic regions corresponding to exons for HBA1.

How did you exactly recover those 26 variants? Did you guys parse some different file?

ADD REPLYlink written 9.1 years ago by Tarbem10

indeed we did Tarbem. I thought you were referring to the 1000 genomes data, so the files I understood you were interested in were those at the project's ftp site: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20101123/interim_phase1_release/ which, by the way, although they have been placed on 20101123 folder they are from the May 2011 release. a little bit confusing, I guess.

ADD REPLYlink written 9.1 years ago by Jorge Amigo12k

The release directories are named for the sequence release the data is based on rather the date they are released on

You cam see snp tracks coloured for consequences from vcf files using the attach remote file option from manage your data so you can attach the vcf files from the 20101123 release

ADD REPLYlink written 9.1 years ago by Laura1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1211 users visited in the last hour