Question: Missing SNPs in from BAM file
gravatar for imbiling
3 months ago by
imbiling0 wrote:

First post here, complete newbie. I'm not sure if this is the relevant place to ask this, so please excuse me if I'm wrong.

I have done WGS at Dante Labs and imported the BAM (100GB) file in The import in Promethease is still pending. I would like to be able to explore "my own" configuration of all known SNPs up to date. For example if I'm reading an article that mentiones rs10994415, rs1006737 and other SNPs, I expect to copy-paste it in selecting the "Variant" column and to see my genotype. However some SNPs are present, but others aren't.

What am I missing? Initially I had a .VCF.GZ file which was supposed to be the "difference" between my genes with the reference genome. My assumption was that the missing SNPs were missing, because they were the same as the reference genome and hence not shown in sequencing/promethease. That's why I requested the BAM file, expecting it to be "my own differences aligned with the reference genome", so all possible SNPs would be included.

Please excuse my lack of knowledge. I'm trying to deal with something out of my expertise domain and I'm unable to understand what's going on for weeks.

sequencing snp bam missing • 266 views
ADD COMMENTlink modified 4 days ago • written 3 months ago by imbiling0
gravatar for Amar
11 weeks ago by
Amar630 wrote:

SNPs, by the very definition (single nucleotide variants), are difference between sample and reference. I don't know how calls variants such as SNPs but chances are, if the SNP is missing from the variant section on that site then that variant is not present in your genome (it matches the reference used).

The BAM file contains all the information of your dna aligned to a reference. What variants gets called or inspected and how they're represented is important. You can potentially manually view the data if you know the location of the SNP (which chromosome + location in particular reference) using software such as IGV. but you need the reference used.

I would recommend you pay someone to analyse your data properly

ADD COMMENTlink written 11 weeks ago by Amar630
gravatar for imbiling
4 days ago by
imbiling0 wrote:

Thank you. Let me elaborate.

Importing the BAM file in Sequencing/Promethease (Promethease has since deprecated BAM file support) didn't yield any additional SNPs. For example "rs1006737" was missing. I am not sure a missing SNP means that my genotype matches the reference genome. This doesn't make sense. If I'm not mistaken, the reference genome could have "bad SNPs" at some positions as well. Isn't this right? It is not a "perfect human being", but a random one.

Further research revealed that SNPs are not readily available from the BAM file. They should be obtained from a process, called "variant calling" which is quite time consuming. In theory Dante Labs's two VCFs (SNPs and INDEL) combined should feed all the data one online service could possibly need to show all possible SNPs. Is this correct?

This post roughly follows my train of thought.

The proposed solution was to do "variant calling" from the BAM file and to produce a single VCF with both SNPs and INDELs combined which fed to any online service would yield all possible results. Am I missing something?

So I started the process. The BAM file is aligned to GRCh37/hg19 and sorted. I started variant calling using GATK. It will take a couple of days on my machine. It is supposed to generate a VCF file with all SNPs and INDELs. In theory when I upload this file to sequencing or promethease and I search for example for "rs1006737" (take a look in SNPedia) I should be able to see if my genotype is (G;G) or (A;A). Right now this particular SNP (for example) is missing. How should I know my genotype for it?

In SNPedia its location (chromosome/position) is relative to GRCh38/hg38 which would be different if I look my BAM file in IGV as it is produced using GRCh37/hg19.

Please excuse my lack of knowledge and context. I'm spending countless hours trying to make sense and it is really hard.

Thank you for offering to pay to someone, but I need to understand the whole concept in my head. I plan to analyze this data a lot in the future and I need to know how it works. Any support is welcome.

ADD COMMENTlink written 4 days ago by imbiling0

Seems like you should be asking the company these questions.

ADD REPLYlink written 4 days ago by swbarnes27.2k

That's why I am writing this post about how to Analyse Your genome. I suspect hundreds of such posts to appear here. I think even in Dante labs web site they explain the difference between vcf and gvcf. So, you are right - it is either 1) variant in position rs1006737 was not called due to low depth or Mapp ability or other quality filters (less likely) 2) the genome is not different from reference in this position and thus this snv is not included in vcf, but should be presented in gvcf.

ADD REPLYlink written 4 days ago by German.M.Demidov1.2k

Hundreds of posts about what to expect in the output of a commercial company? Nope.

ADD REPLYlink written 4 days ago by swbarnes27.2k

That's actually the case, this is just one example C: Any user friendly way to find rare mutations in whole genome raw? there was one post more for sure (I answered there too) but I think it was removed by the author

ADD REPLYlink modified 4 days ago • written 4 days ago by German.M.Demidov1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1106 users visited in the last hour