Question

learning human genetic predictions using genome vs. ClinVar

0

Entering edit mode

4.7 years ago

gnmcsbnfrmtcsclb ▴ 70

We are a group of ~ 20 rising sophomore and juniors that are interested in group learning of new and interesting concepts in genetics and genomics. We seek your help with answers to the following questions please, about analyzing human genome sequences.

A little context: One student in our group has a Senegalese father and a Japanese mother. They had their genomes sequences 30X coverage shotgun, and very generously shared their data as the following filetypes with us - FASTQ, CRAM, CRAI, VCF and TBI.

Our questions are:

Is there a detailed tutorial you would recommend that can we use to predict disease states, by comparing VCF file (given to us) versus ClinVar database? Is it possible to do this via locally installed software and database(s)?
Does having parents with different ethnicities complicate use and/or interpretation of ClinVar database?
Is there a detailed tutorial on how to convert CRAM file to genome sequence? This would require us to know which reference was used to align to, in order to convert alignments back to sequences, right?
For human NGS - Illumina based FASTQ sequences, is there a standard pipeline for de novo genome assembly without a reference? If yes, then please share link(s) and tutorials. Thank you.
For any given assembled human genome, is there a standard pipeline for genome annotation? If yes, then please share link(s) and tutorials. Thanks again.

Through some postdocs we know, we have access to some HPCC accounts, so we can run >10cpus at a time, with > 100GB memory.

Thanks in advance for your advice, suggestions and sharing relevant links to software and tutorials.

ClinVar genotype genome RNA-Seq • 1.5k views

ADD COMMENT • link updated 4.7 years ago by JC 13k • written 4.7 years ago by gnmcsbnfrmtcsclb ▴ 70

1

Entering edit mode

I don't think 30X is good enough coverage to make clinically accurate determinations. Also, anything even remotely accurate needs to be vetted by doctors and clinical genetics counselors, even for simple single-gene disorders, as no genotype is associated with a fixed phenotype to a "set in stone" level. We learn new information every day, and ClinVar doesn't really measure up to a clinically usable database.

The questions you ask above need a team of full time experts to consult and explain, it's not something you can expect from an online forum of volunteers.

ADD REPLY • link 4.7 years ago by Ram 45k

0

Entering edit mode

Thank you for your response. Is there a scientific consensus about the minimum acceptable fold coverage for sequencing in order to draw clinically related conclusions? And is there an open source database like ClinVar that folks use and prefer over ClinVar? Thanks again.

ADD REPLY • link 4.7 years ago by gnmcsbnfrmtcsclb ▴ 70

1

Entering edit mode

You could try HGMD (which is manually curated with information taken from publications), which is IMO a tad better than CLINVAR, but I doubt that will make a difference. I'm not sure of the preferred coverage for clinical-level accuracy, but mutation data alone cannot predict too many diseases.

In any case, you may want to restrict yourself to pathogenic entries from CLINVAR - ideally, only those that do not have conflicting evidence, where every piece of evidence points to the mutation being pathogenic.

ADD REPLY • link 4.7 years ago by Ram 45k

0

Entering edit mode

Thank you, gonna use recommendations from you and JC to learn new concepts, may take us at least a few weeks of learning from tutorials with some small and smple test cases to even start the analysis we envision. At that time, we will post any follow up questions / doubts. Also, we think it may be better for us to start with some data that is higher coverage ~ 100X rather than get stuck with a genome assembly or VCF file that will be a hurdle in us learning these analyses. So if you have any suggestions for such a test genome that is open source for download and use, please share. Thanks again.

ADD REPLY • link 4.7 years ago by gnmcsbnfrmtcsclb ▴ 70

1

Entering edit mode

You can search SRA for datasets at that level of coverage, but I am not sure if you'll find any clinical grade dataset.

ADD REPLY • link 4.7 years ago by Ram 45k

score 5 · Accepted Answer · 2020-10-15

5

Entering edit mode

4.7 years ago

JC 13k

I agree with what RamRS pointed, your coverage is too low, any finding is also not definitive unless you have more data and expert evaluation by geneticists and doctors. If you want the data just to showcase what Personal Genomics is:

Annotate your VCFs with VEP, use also Gemini to easy filtering
Not really but you will find many variants has not been annotated in any database, so no evidence of allele frequency
You can convert the CRAM to BAM and create a consensus and yes, you need the reference used in the mapping
De novo assembly will require more than 30X coverage and a lot of memory
What do you mean for "genome annotation"? The human genome is known

ADD COMMENT • link 4.7 years ago by JC 13k

0

Entering edit mode

Thank for so much for your systematic replies. Our goal is to learn, so we thought free data, would be interesting to just help us learn. So we will try out those software as a learning exercise in both theoretical concepts and practical analyses.

Usually, for human genome, with only 30X coverage, if it is reference genome guided, and not de novo assembly, as was done here by some sequencing company, then for regions that exhibit presence / absence variation between reference and target genomes, wouldn't there be problems with assembly not reflecting actual sequence? We are thinking this way because of a famous paper that compared an African pan genome to the European reference genome - link. And because, as we mentioned in our OP, the parents are African and Asian - neither are European.

By genome annotation, we mean predicting genes, pseudogenes, transposons etc. Is VCF generated agnostic or even without annotation of the new assembly? Is the VCF file generated based just on mapping, or by comparison of annotated gene sequences at same syntenic loci and which are homologous? Sorry if our jargon is a little confused or confusing, but hopefully we've explained our questions clearly enough. Thanks in advance.

ADD REPLY • link 4.7 years ago by gnmcsbnfrmtcsclb ▴ 70