Gene location of both inherited gene
Entering edit mode
8 months ago
bowman • 0

Hallo, I am a beginner in bioinformatics. So sorry for the following question, but perhaps someone can help me:

I created a WGS of my genone at dante labs and Both services give me the raw data and I learned to understand e.g. the vcf format and also understood how chromosomchanges and other mutations are expressed within the vcf. And wrote some small vcf parsers in java.

In the classical genetic we learned that most genes exists twice: one from mom and one from dad.

But I absolutly not undersand how I can find the two genes within a sam or vcf. For example the LPL gene. How can I find the LPL Gene I got from my mother and where I can find the LPL Gen I got by my father?

If I look at "". It gives me one starting postion of LPL on GRCh38.p14: 19939253

But if we have two genes of LPL (or other genes) does I not need two starting positions?

Thank you for your help!!

inherited positions starting gene • 580 views
Entering edit mode
8 months ago
cmdcolin ★ 3.2k

the human reference genome is "haploid" meaning that, for example GRCh38.p14 has only one copy of chr1-22,X,Y. you would have two copies of the gene in each diploid cell in your body, but for bioinformatics, we only "map to a haploid reference". this answer is a little insufficient, so i'll try to explain where mom and dad might come in:

let's say we "sequenced your genome". if you sequenced your genome, you would get "reads" from the sequencer machine, which are like millions of 150bp sequences from illumina, in a FASTQ file.

what do you do to analyze your sequence data? well, you would map your reads versus a reference genome, the reference genome being like GRCh38.p14 there.

this mapping program, like bwa, or bowtie, or many others, produces a BAM or CRAM or SAM file, and it can tell during mapping that it should map at some particular position, but maybe there is a 1 letter difference. this could be a "SNP"

then, maybe, only half the reads at a particular position have a SNP, and the other half don't. since approximately half the reads come from your mom and the other half came from your dad, that means maybe you inherited a SNP from your mom (which is shown on half the reads) but didn't inherit a SNP from your dad

this shows how we can get information about your diploid genome, even when we are comparing to a haploid reference. you do ask, well, shouldn't it be at two different start positions? well, we assume it only has one start position when we compare to the reference genome, and we only look at mutations to see if those mutations affect the gene of interest.

in the future, "graph based genomes" may be used instead of a single reference genome. graph based genomes is kind of like mapping to multiple reference genomes at once, but it is a somewhat advanced topic that is still evolving, so comparing to a single reference genome is commonly done today.

hope that helps

Entering edit mode

note that, in VCF format, the data from mom and dad is listed in the "genotypes" column. the genotypes can be phased (which allows you to interpret the genotypes from multiple lines in a VCF as all coming from the same parent, mom or dad) or unphased (do not know whether it came from mom or dad). you wouldn't know whether the phased/unphased blocks came from mom or dad unless you genotyped mom and dad (e.g. sequenced them too) or did some ancestral analysis on the snps to find out likelihood of e.g. the haplotypes belonging to their ancestry. small explainer What Are Phased And Unphased Genotypes?

Entering edit mode

Thank you very much for the clear explanation! Now I understand basic idea behind this topic.


Login before adding your answer.

Traffic: 3180 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6