Question

Noob question about reading SNP results from BAM file visually

0

Entering edit mode

5.2 years ago

jack.kingsman ▴ 10

I'm a beginning self-taught, so please have patience with my potential misunderstandings.

When looking at SNPedia (e.g. https://www.snpedia.com/index.php/Rs4988235), SNPs are identified at a single site as hetero/homozygous. Assumption 1: these sites are from the two copies of the chromosome, from mother and father.

When I'm looking at a BAM file in a visual viewer, how can I identify the results of a site as it pertains to SNPs? As far as I understand, (assumption 2) BAM files' reads at a site should, in an ideal world, converge to one single read at a point. Have I misunderstood, and that a heterozygous call will be represented by a consistently mixed read (i.e. mostly half and half or thereabouts)?

I totally understand that the workflow should be to generate a VCF with SNPs called, but I'm curious about how to do it without this step (trying to understand what my tools are doing more holistically) .

Thanks for your teaching and patience! If you're local to the San Jose CA or Peninsula area, I'd love to buy you a beer and learn! 🙂

SNP sequencing • 2.1k views

ADD COMMENT • link updated 5.2 years ago by manuel.belmadani ★ 1.3k • written 5.2 years ago by jack.kingsman ▴ 10

score 1 · Answer 1 · 2019-01-26

Hello and welcome to biostars jack.kingsman .

Assumption 1: these sites are from the two copies of the chromosome, from mother and father.

Yes, the variants are located on one (heterozygous) or both (homozygous) copies of the chromosomes. And as long as this variant is not a result of new mutation, they are inherited from the parents.

As far as I understand, (assumption 2) BAM files' reads at a site should, in an ideal world, converge to one single read at a point.

I don't understand this sentence completely. The most widely used NGS sequencing technic is call "short-read-sequencing". For this you take your DNA sample, which was obtain from many white blood cells, break them into smaller peaces. These peaces get sequenced. Afterwards each sequenced is assigned to it's genomic location in a process called mapping and alignment.

Because we have many DNA copies in our sample and we break them (more or less) by chance into many pieces, each site of the sequenced region is covered by many reads that might overlap.

In an ideal world, you can identify variants visually. If all reads at a given site show a variant, the variant is present on both chromosome. If it is present on half of the reads, the variant is located on one chromosome only.

Unfortunately we don't live in an ideal world. We introduce errors during preparation of the sample for sequencing. The sequencing process introduces more errors. And the mapping and alignment is not always easy. This is why we need the variant calling step - which result in a vcf file. Because this step try to take all these possible errors into account.

fin swimmer

score 1 · Answer 2 · 2019-01-26

I agree with @finswimmer's post, this is really just more how I look at these practically. For your question, that might depend more on the software you're using, but what you're describing sounds like IGV (Intergrative Genome Viewer)

( IGV

You would load in a BAM file, and select an assembly (basically a reference genome for humans). Most of it will match the reference and is just those grey bands; each row is a read. In some cases, you have alternative alleles to the reference (in this file, 9 Cs, 4 Ts, 3 Gs and 160 As which are mostly truncated from the image.)

So you have multiple reads stacked, and sometimes there's noise, bad calls, uneven allele coverage. Some real heterozygous calls could have a 50-50 split, or a 70-30 split, or something messier like that example. That little bar stack at the top shows mostly green (the 160 As) with a sliver of blue (the 9 Cs). In general I don't put too much value into calls from regions that have a lot of conflicting calls (say every 2-3 base was like that example call), the sequencing is likely poor in those region. That being said, while you're right that you typically want to generate a VCF, I would argue that you should inspect anything really interesting (e.g. variants you want to report for a clinical sequencing study) in IGV or something similar, to make sure the call looks clean.