Question

Integrate Data From Affy6.0 And Whole Genome Sequence

1

Entering edit mode

12.9 years ago

Lds ▴ 450

I have three SNP data set, hapmapCHBb37.txt by SNP calling from original .CEL file of HapMap, kgCHBb37.txt from 1000 genomes VCF files and sampleCHBb37.txt from our whole sequencing data(15x, 50 sample size). But there're some discordance among these three data set. Take the HapMap data as the reference SNP, there're some alleles error(A/G v.s C/G), or reversed alleles(A/G, C/T) in kgCHBb37.txt. So I have to extract the overlapped samples from hapmapCHBb37.txt and kgCHBb37.txt and correct the alleles (for A/T and C/G, I have to use the allele frequency), then I have the alleles correction file correct.snp. I use the original hapmapCHBb37.txt and kgCHBb37.txt to check the correct.snp, and the result is good. But when I use correct.snp to correct the alleles in sampleCHB_b37.txt, the result is very bad. I'm wondering,

1> What's the best strategy to integrate SNP from whole-genome sequencing and affy6.0?

2> Maybe the quality control is not good in the sampleCHB_b37. I'm using Bowtie(-k 2 -v 2) to get SAM files from fastq (reference genome: hg19.fasta), then convert SAM to BAM, sorting BAM, convert the BAM to BCF(D2000), and convert the BCF to VCF with samtools. Is this framework reliable to call SNPs from fastq?

exome sequencing bowtie samtools • 2.6k views

ADD COMMENT • link updated 12.7 years ago by Liyf ▴ 300 • written 12.9 years ago by Lds ▴ 450

0

Entering edit mode

Each people have their own SNP, so the result of different sample sequencing data are different. I have many genotype data and compared with hapmap, they are some discordance, too. So do not worry.

ADD REPLY • link 12.5 years ago by Liyf ▴ 300

score 1 · Answer 1 · 2011-08-02

it is generally thought that everything stored on dbSNP, HapMap, or even on the recent 1000genomes is not only useful but also completely trustworthy. you should never forget that all the publicly available datasets, although very valuable, are just references for your experiments. you can never take them as dogmas, but as useful (very useful in fact) templates for your own data interpretation.

there are several reasons why your own experiment may not exactly end up with the same results: obvious laboratory issues, different populations used, population stratification on your samples, particular trait overrepresented, ... it is not rare at all to find out that the allele frequencies found at your lab do not correlate properly with references repositories' such as HapMap, but once you know the possible causes then you can start trusting or not your results.

regarding inverted alleles, it is not as rare as you may think. both 1000genomes and your experiment are using sequencing techniques, but hapmap used genotyping. for that reason, and because genotyping techniques depend on targets design that sometimes need to use complementary strands just because the affinities are compromised (for instance), you may be able to merge data easily if you pay attention to the strand sign on hapmap's data, since almost certainly all your variants you are testing against are forward strand based. although not the only reason, this may be very helpful when talking about bulk comparing.

regarding non matching alleles, I could start talking here about statistical power and sample sizes, but in general let me just say that it is not unbelievable that you end up finding alleles on your own experiments that weren't detected previously, specially if the MAF values you are dealing with are very low, since the variant sites may be non biallelic on the entire human population. it is true that it would be very strange that a CT site with 0.4/0.6 frequency on HapMap would be detected on your experiment as a GT site with 0.35/0.65 frequency for instance, but it could be possible due to the population constrains mentioned above. it is quite complicated to build up a programmatic strategy to deal with those cases, so I would suggest first to detect them, then to study them by frequency, and ultimately decide whether to trust your own experiment results or the reference's.

score 0 · Answer 2 · 2011-10-13

0

Entering edit mode

12.5 years ago

Liyf ▴ 300

I really found that there are some reversed alleles in CHB+JPT 1000 genomes. And if you still preserve the correct.snp file? I just want to loaf on job. Thanks.

ADD COMMENT • link 12.5 years ago by Liyf ▴ 300